Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Retries as a first-class concept

It is a Tuesday at PaySetu and Meera, an SRE on the payouts team, is reading a postmortem written by a junior engineer. The postmortem describes an outage where the merchant-credit API's vendor returned a stream of 502 Bad Gateway for 38 minutes during a switch-over. The settlement workflow's activity called bank_credit(...) and got the 502 back. The engineer's code retried, three times, each retry 100 ms after the last. After three attempts, it gave up and marked the workflow failed. The vendor recovered at minute 39. Forty-one workflows had already been marked failed; their settlements had to be reconciled manually. Meera leaves a single comment on the postmortem: "why is the retry policy in the activity body and not on the activity registration?" That comment is the entire chapter.

In a workflow engine, retries are not a thing the function does — they are a thing the engine applies to the function. The retry behaviour is declared at activity-registration time as a RetryPolicy (Temporal), Retry block (AWS Step Functions), or policy.retry() (DBOS). The engine reads the policy, dispatches the activity, observes the outcome, and decides — outside your code — whether to retry, how long to wait, and when to give up. Promoting retries from caller code to engine config is one of those small moves that looks like a stylistic preference and is actually a structural one.

A first-class retry is a declarative policy attached to an activity, not a try/except inside it. The workflow engine — Temporal, Step Functions, Cadence, DBOS — owns the retry loop, persists every attempt to the workflow history, and exposes the policy as inspectable, versionable configuration. The benefits compound: retries become idempotent across worker crashes, retry behaviour becomes auditable, and the workflow code stops carrying transient-failure paranoia.

What "first-class" means here — and why it matters

In normal Python, a retry looks like this:

for attempt in range(3):
    try:
        result = bank_credit(merchant_id, amount)
        break
    except TransientError:
        time.sleep(0.1 * (2 ** attempt))
else:
    raise BankCreditFailed()

Five concerns are tangled in those eight lines: what to retry, when to retry, how many times, which exceptions count as transient, and what to do on permanent failure. Each of those is a policy decision. Each of those changes when the operating environment changes — when the bank vendor's recovery time goes from 200 ms to 38 minutes, when a new payment partner shows up with a different SLA, when the team decides that 503 should be retried but 429 should back off harder. Every change is a code change, a code review, a deploy, and a coordination across the four teams that import this helper.

A first-class retry pulls all five concerns out of the function and into a config object:

@workflow.defn
class PayoutWorkflow:
    @workflow.run
    async def run(self, merchant_id, amount):
        return await workflow.execute_activity(
            bank_credit, merchant_id, amount,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(milliseconds=500),
                maximum_interval=timedelta(seconds=60),
                backoff_coefficient=2.0,
                maximum_attempts=12,
                non_retryable_error_types=["AccountFrozenError"],
            ),
        )

The function bank_credit itself is just def bank_credit(mid, amt): ... — no retry loop, no try/except, no time.sleep. The engine reads the RetryPolicy, calls bank_credit, and if it raises something that is not in non_retryable_error_types, the engine waits initial_interval × backoff_coefficient^attempt (capped at maximum_interval), records the attempt to history, and tries again. Why this matters: the retry policy is now a declared, versioned, auditable thing. It shows up in the workflow's history. The SRE on call at 03:14 can read the policy without reading the function body. A platform team can write a linter that flags activities whose maximum_attempts is undefined. A traffic-engineering team can change backoff_coefficient from 2.0 to 1.6 across all activities by editing the registration, not the implementation.

The shape of this shift — pull the policy out of the function, give it to the runtime — is the same shift that took us from in-process retries to circuit breakers (see /wiki/circuit-breakers-hystrix-sentinel) and from manual rate-limiters to platform-managed ones. The pattern is: the runtime is the right home for cross-cutting concerns.

Retry logic promoted out of caller code into engine configTwo side-by-side panels comparing the same logical activity. Left panel: a function with a tangled retry loop, sleep, and exception classification embedded in its body. Right panel: a tiny activity body and a separate RetryPolicy object held by the engine, with the engine doing the dispatch, sleep, and re-dispatch. Arrows show the engine reading the policy and applying it across attempts. Illustrative. Retry logic embedded vs. retry logic declared Embedded — caller owns retry def bank_credit_with_retry(...): for attempt in range(3): try: return vendor_api.call(...) except TransientError: time.sleep(0.1 * 2**attempt) except PermanentError: raise raise GiveUp // 5 concerns tangled in 8 lines Declared — engine owns retry def bank_credit(...): return vendor_api.call(...) // 1 line of business, 0 retry code RetryPolicy( initial_interval=500ms, backoff_coefficient=2.0, maximum_attempts=12, non_retryable=[AccountFrozen] ) The right side is auditable; the left side is buried in code review history.
Same logical behaviour. The right side is a config object the engine reads at dispatch time; the left side is logic only its author understands. Illustrative.

What a retry policy actually contains

The minimum viable retry policy has six fields, and the names are roughly stable across Temporal, Step Functions, Cadence, and DBOS:

  1. initial_interval — how long to wait before the first retry. Typically a few hundred milliseconds for low-latency activities, a few seconds for partner APIs.
  2. backoff_coefficient — the multiplier between successive intervals. 2.0 is the canonical exponential; 1.5 is gentler; 1.0 is constant interval (rarely correct).
  3. maximum_interval — a ceiling on the interval. Without this, a 30-day workflow would eventually retry once a month. Typical caps: 60 seconds for hot-path APIs, 15 minutes for daily batch jobs.
  4. maximum_attempts — total attempts (including the first). Setting this to 0 means "retry forever" and is the right answer for some workflows (a payout will eventually go through; the workflow's job is to wait).
  5. maximum_interval or expiration_interval — most engines also let you cap the total wall-clock time spent retrying. Useful when the workflow has a downstream deadline.
  6. non_retryable_error_types — exceptions that should fail the activity immediately. The classic list: authentication errors, validation errors, "this account is frozen", "this idempotency key has already been used with a different payload". Anything that no amount of waiting will fix.

The interaction of these six fields is where engineers get tripped up. Why this is subtle: maximum_attempts=10 with initial_interval=1s and backoff_coefficient=2.0 means the worst-case wait between attempt 9 and 10 is 2^9 = 512 seconds, and the cumulative wall-clock is 1 + 2 + 4 + ... + 256 + 512 = 1023 seconds, just over 17 minutes. If the activity's start_to_close_timeout is 30 seconds and the workflow's overall deadline is 5 minutes, the policy will never reach attempt 10 — the workflow will time out first. The policy must be designed against the activity's timeout and the workflow's overall deadline together; designing them in isolation produces silent dead code.

The retry interval also typically receives jitter — a small random perturbation, usually ±20%, applied to each interval to break synchronised retry storms across many concurrent workflows. Temporal's default jitter is 0.2 (i.e. ±20%); Step Functions adds equivalent decorrelated jitter automatically. Without jitter, a partner API outage that affects 10,000 in-flight workflows produces 10,000 retries at exactly the same instant the moment the API recovers — exactly the thundering-herd failure the engine is trying to prevent. (See /wiki/retries-exponential-backoff-jitter for the underlying mathematics; this chapter is about the promotion of that mathematics to a declared policy.)

The retry history — every attempt is in the audit log

Because retries are owned by the engine, every single attempt becomes an event in the workflow's history. A failed workflow's history might read:

ActivityScheduled       bank_credit  attempt=1
ActivityFailed          bank_credit  attempt=1  reason=Timeout(30s)
ActivityScheduled       bank_credit  attempt=2  scheduled_at=t+0.5s
ActivityFailed          bank_credit  attempt=2  reason=502 BadGateway
ActivityScheduled       bank_credit  attempt=3  scheduled_at=t+1.0s
ActivityFailed          bank_credit  attempt=3  reason=502 BadGateway
ActivityScheduled       bank_credit  attempt=4  scheduled_at=t+2.0s
ActivityCompleted       bank_credit  attempt=4  result={ok: true, txn: C8821}
WorkflowCompleted       result=DONE
A retry timeline with exponential backoff and jitter, recorded into historyHorizontal timeline showing four activity attempts. Each attempt is a marker on the time axis with a label. Between attempts, gap widths grow exponentially (0.5s, 1.0s, 2.0s) with shaded jitter bands around each gap. The fourth attempt succeeds and is marked in the accent colour. Below the timeline a row of history-event labels mirrors each attempt. Illustrative. Retry timeline — exponential intervals + jitter, every event in history t=0 t attempt 1 fails: 502 ~0.5s ± 20% attempt 2 fails: 502 ~1.0s ± 20% attempt 3 fails: 502 ~2.0s ± 20% attempt 4 success ActivityScheduled ActivityFailed ActivityScheduled ActivityFailed ActivityScheduled ActivityFailed ActivityScheduled ActivityCompleted Each attempt and each wait is appended to the workflow history. The SRE can replay the whole sequence after the fact.
Every retry, every wait, every outcome lands in the history store. Hand-rolled retry loops give you only the final success or failure — not the attempts that produced it. Illustrative.

Each line is queryable. The SRE during the postmortem can ask: for the workflows that succeeded eventually, what was the median number of retries? — and answer it with a SQL query on the history store. They can ask: which non-retryable error types were thrown most often? They can ask: what fraction of activities never reached their first retry because the workflow timed out at the same moment? These questions are unanswerable when retries live in caller code; the function returns success, the metrics show success, the four retries inside it are invisible to everything outside the process.

The same property — every attempt in the history — gives you something subtler: retry idempotency under worker crash. If a worker crashes mid-retry, the engine notices the activity's heartbeat has stopped, and the retry loop continues on a different worker. The new worker reads the history, sees attempt=3 in progress, last heartbeat 12s ago, and resumes the retry from attempt=3 rather than starting over from attempt=1. The retry counter is durable, just like everything else in the workflow. Why this matters: in a hand-rolled retry loop, the retry counter is a local variable. If the process dies mid-loop, the counter is lost, and the loop restarts from zero — which means a vendor that takes 30 seconds to recover gets hammered for 90 seconds (three full retry runs of three attempts each) instead of the budgeted 30. The bug is invisible until you scale up to enough concurrent workflows that worker crashes are no longer rare events.

A toy retry-aware activity engine — runnable

The 50-line version below shows the policy-driven dispatch loop. It does not persist anything (real engines do); it does not parallelise (real engines do); but it captures the core: the activity body is small, the policy is a config, and the loop is run by the engine.

import time, random
from dataclasses import dataclass
from typing import Callable, Type, List

@dataclass
class RetryPolicy:
    initial_interval: float = 0.5
    backoff_coefficient: float = 2.0
    maximum_interval: float = 60.0
    maximum_attempts: int = 5
    jitter: float = 0.2
    non_retryable: List[Type[Exception]] = None

    def delay(self, attempt: int) -> float:
        base = min(self.initial_interval * (self.backoff_coefficient ** (attempt - 1)),
                   self.maximum_interval)
        return base * (1 + random.uniform(-self.jitter, self.jitter))

def execute_activity(name: str, fn: Callable, args: tuple, policy: RetryPolicy):
    history = []
    for attempt in range(1, policy.maximum_attempts + 1):
        history.append((name, attempt, "scheduled"))
        try:
            result = fn(*args)
            history.append((name, attempt, "completed", result))
            return result, history
        except Exception as e:
            non_retryable = policy.non_retryable or []
            if any(isinstance(e, t) for t in non_retryable):
                history.append((name, attempt, "failed-permanent", repr(e)))
                raise
            history.append((name, attempt, "failed-transient", repr(e)))
            if attempt == policy.maximum_attempts:
                raise
            wait = policy.delay(attempt)
            history.append((name, attempt, "wait", round(wait, 3)))
            time.sleep(wait)

# Activity body — no retry logic
class TransientError(Exception): pass
class AccountFrozen(Exception): pass

_calls = {"bank_credit": 0}
def bank_credit(mid, amt):
    _calls["bank_credit"] += 1
    if _calls["bank_credit"] < 4:
        raise TransientError("502 BadGateway")
    return {"ok": True, "txn": f"C{random.randint(1000,9999)}"}

policy = RetryPolicy(initial_interval=0.1, maximum_attempts=6,
                     non_retryable=[AccountFrozen])
result, hist = execute_activity("bank_credit", bank_credit, ("M-7731", 4500), policy)
for ev in hist: print(ev)
print("RESULT:", result)

Sample run:

('bank_credit', 1, 'scheduled')
('bank_credit', 1, 'failed-transient', "TransientError('502 BadGateway')")
('bank_credit', 1, 'wait', 0.092)
('bank_credit', 2, 'scheduled')
('bank_credit', 2, 'failed-transient', "TransientError('502 BadGateway')")
('bank_credit', 2, 'wait', 0.218)
('bank_credit', 3, 'scheduled')
('bank_credit', 3, 'failed-transient', "TransientError('502 BadGateway')")
('bank_credit', 3, 'wait', 0.371)
('bank_credit', 4, 'scheduled')
('bank_credit', 4, 'completed', {'ok': True, 'txn': 'C8421'})
RESULT: ({'ok': True, 'txn': 'C8421'}, [...])

Four attempts, three transient failures, three jittered waits, one success. The activity body knows nothing about retries; the engine knows everything. The history list is the audit log. Now imagine _calls["bank_credit"] is replaced with a real vendor API and the RetryPolicy is loaded from a config service — that is what real Temporal does, plus persistence.

The walkthrough, line by line:

  • RetryPolicy.delay(attempt) computes the exponential interval for attempt and applies jitter. For attempt 1 it returns initial_interval ± 20%; for attempt 5 it returns min(initial_interval × 2^4, maximum_interval) ± 20%.
  • execute_activity is the engine's dispatch loop. It records every state change to history, distinguishes retryable from non-retryable exceptions, and gives up only when maximum_attempts is exhausted.
  • The activity bank_credit is a single business expression with no retry awareness. To swap retry behaviour, change the RetryPolicy argument; the activity does not change.

Where this shows up in production

CricStream's video-publish workflow has 11 activities. Of those, 7 talk to internal services (low-latency, low-error-rate) and 4 talk to external partners (CDN purge, ID-provider sync, a third-party analytics push, a moderation vendor). The retry policy for the internal 7 is initial_interval=200ms, max_attempts=3. The retry policy for the external 4 is initial_interval=2s, max_attempts=20, max_interval=10min. The split is declared at activity registration; the workflow function itself is identical-looking for all 11. When the moderation vendor goes down for 47 minutes during the IPL final, the four affected activities patiently retry — the longest one logs 14 attempts spaced from 2 seconds to 9 minutes apart — and 96% of them recover without the workflow being failed. The remaining 4% hit max_attempts=20 and are routed to a manual-review queue. Why this works without anyone touching workflow code: the policies are config; the platform team adjusted them once after the previous IPL final's lessons; the application team never had to think about retry mechanics during the incident.

PaySetu uses a different retry shape for compensations. The settlement workflow's main path retries with a normal exponential backoff. But the compensation activity — the rollback action that runs on saga failure — uses max_attempts=0 (i.e. retry forever) with an aggressive maximum interval of 24 hours. The reasoning: a compensation that fails is worse than a compensation that takes a day. Money sitting in the wrong account is a regulatory issue; money taking 12 hours to return is annoying. The retry policy encodes the business priority directly: this compensation will retry until human heat-death if needed. (See /wiki/the-saga-pattern-revisited-in-workflows for how saga compensation works.)

Common confusions

  • "Retries belong in the application code where the engineer can see them" — They do not. Embedding retries in caller code makes them per-callsite, per-engineer, per-team. Promoting them to engine config makes them per-activity, declared in one place, auditable from one place. The engineer who wrote the activity body and the engineer who tunes the retry policy are usually different people; that is the point.
  • "Setting max_attempts=0 is dangerous, it means infinite retries" — In Temporal it does mean infinite retries, and it is exactly right for activities whose business contract is "this will succeed eventually". The danger is not the retry count; the danger is forgetting to set a expiration_interval or start_to_close_timeout, so the workflow has nothing bounding how long it sits.
  • "Retries are the same as redrives" — A retry is automatic, in-policy, owned by the engine. A redrive is a manual replay of a failed workflow, usually triggered by an operator after fixing an upstream bug. Retries handle transient failures within a workflow; redrives handle permanent failures between workflows.
  • "Non-retryable errors should be exceptions; retryable errors should be return values" — They should both be exceptions. The engine distinguishes them by type, not by control-flow shape. Returning a {"error": "frozen"} dict instead of raising loses the engine's ability to apply the retry policy, because the engine sees a successful return and proceeds.
  • "Once an activity is in the non-retryable list it can never be retried" — It can be redriven manually after the operator fixes the cause. The non-retryable list says "the engine should not auto-retry without human intervention", not "this can never run again".
  • "A retry policy with backoff_coefficient=1.0 is just constant retries" — Yes, and that is almost always wrong. Constant-interval retries produce the thundering-herd outage described in /wiki/retries-exponential-backoff-jitter; they are the ancestral failure mode that exponential backoff was invented to fix. Engines accept 1.0 because some activities really do want it (e.g. polling a slow database); 99% of the time, choosing 1.0 is a mis-configuration.

Going deeper

Retry budgets — the next layer up

RetryPolicy controls retries per activity. A retry budget controls retries across all calls to a downstream service. The budget is usually expressed as "no more than 10% of total RPS to service X may be retries". When the budget is exceeded, new retries are dropped or fast-failed, even if the per-activity policy would have allowed another attempt. This is the engine-level analogue of the circuit breaker: retries are good for one workflow, but at scale they can collectively kill the very service they are trying to recover from. Temporal does not ship a built-in retry budget; teams implement it via a coordinator activity that checks a global rate-limiter before allowing the real activity to dispatch.

Idempotency and retries — the silent contract

A retry policy assumes the activity is idempotent. If bank_credit(M-7731, 4500) is not idempotent, the engine's third retry might double-credit the merchant. The engine has no way to know. The contract — "if you accept retries, you accept idempotency" — is on the activity author, not the engine. The standard implementation: pass an idempotency key (often the workflow id + activity id + attempt number) to the downstream service, which dedups on it server-side. (See /wiki/at-least-once-idempotency-in-practice for the patterns; this chapter is about why retry promotion makes idempotency more important, not less.)

Retry policy versioning — the operational tail

Like workflow versioning, retry policies are part of the workflow's binding to history. A workflow that started under policy v1 (max_attempts=3) and is mid-execution when policy v2 (max_attempts=12) is deployed will, by default, finish under v1. Some engines let you migrate in-flight workflows to the new policy via a one-shot operator action; most do not, on the grounds that changing the retry behaviour mid-workflow is more dangerous than letting it complete under the old policy. The lesson: tune retry policies aggressively for new workflows, conservatively for in-flight ones, and never assume a deploy retroactively changes behaviour.

When to not retry — the philosophical version

The most common retry mistake is not the policy parameters; it is retrying things that should not be retried at all. Three categories that almost never benefit from retries: validation errors (the input is bad and will be bad on attempt 2), authentication/authorisation errors (the token is wrong and will still be wrong on attempt 2), and contract errors (the API has been deprecated and the next attempt will hit the same 410 Gone). Putting these into non_retryable_error_types is not a defensive measure — it is the primary purpose of the field. Every minute spent retrying a ValidationError is a minute the user is staring at a spinner waiting for the workflow to give up.

Where this leads next

First-class retries are one of three orthogonal axes the workflow engine takes ownership of. The others are timeouts (every activity has a start_to_close, schedule_to_start, and schedule_to_close deadline) and compensations (every successful activity has a paired rollback the engine can run on saga failure). The combination — declared retries, declared timeouts, declared compensations — is what lets the workflow function be the business logic only, with all the resilience concerns lifted into the engine's configuration surface.

The lesson the rest of this section inherits: resilience patterns belong in the runtime, not the function. Once you accept that, the workflow code stops looking defensive and starts looking like business logic — which is what it should have been all along.

References