Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Dead-letter queues and retries
MealRush's order-pricing service started failing at 19:42 on a Saturday during the dinner rush. A new tax rule had been merged that morning; one out of every 8,000 orders contained a corner-case PIN code that crashed the pricing function with a KeyError. The consumer's auto-retry kicked in: the bad message was re-delivered, crashed again, re-delivered, crashed again. By 19:48, the consumer group had spent six minutes locked on a single message at the head of partition 17, and the other 99,999 healthy orders behind it were piling up. Karan — the on-call — opened the consumer logs and saw the same KeyError printed 3,400 times. The fix was not "make the function not crash". The fix was a queue for the messages that cannot be processed, and a retry policy for the messages that just need a moment. Without those two pieces, one bad input takes the whole pipeline down.
This chapter is about the two halves of that recipe — retries for transient failures and dead-letter queues (DLQ) for permanent ones. Get the boundary between "transient" and "permanent" wrong and you either drop money on the floor or you build a self-inflicted DDoS against your own database. Get the retry policy wrong and a downstream blip becomes a thundering herd. Get the DLQ wrong and your team ignores a Slack channel called #mealrush-dlq for nine months until somebody finally reads it and finds 41,000 unprocessed refunds.
A retry policy converts transient errors into eventual success; a dead-letter queue converts permanent errors into ticketed problems instead of stuck pipelines. The two engineering decisions that matter are how to classify an error (retryable vs terminal — never both) and what triggers the move to DLQ (retry-count, age, or per-error-class hard exit). The DLQ is not a graveyard; it is a parked-car lot, and somebody has to be paid to walk through it.
What goes wrong without a DLQ
Without a DLQ, a poison message — one that will never succeed no matter how many times you retry it — has three places it can go, and all three are bad.
It can stay at the head of the queue and block every message behind it. This is what happened to MealRush above: a Kafka consumer with max.poll.records=1 and infinite local retries simply stopped advancing the offset. The partition's lag chart climbed by 800 messages a minute. Other partitions were fine. From the operator's point of view, the consumer was "running" — the JVM was up, heartbeats were green — but nothing was being committed.
It can be silently dropped if the consumer catches the exception and acknowledges anyway. This looks fine in metrics (no errors, full throughput) and is catastrophic in reality. PaySetu had a payouts consumer that wrapped its handler in try/except: pass for a fortnight before anyone noticed the success rate had quietly dropped from 99.97% to 96.4% — the missing 3.6% was real-money refunds being acked-and-discarded.
It can be retried forever at the consumer until the broker's redelivery counter wraps or the message ages out. RabbitMQ without a TTL or a x-max-delivery-count will redeliver the same KeyError-causing message 4 million times in a day, generating 4 million error-log lines, 4 million metric points, and 4 million pages to the on-call.
The DLQ is a fourth, deliberate, place: a separate destination for messages the system has given up on, where they wait in a cold-storage state until a human or a tool decides what to do with them.
Classifying errors — the only decision that matters
Every retry policy is a classifier. The classifier takes an exception or a status code and returns one of three verdicts: retryable, terminal, or — the dangerous third option — don't know. The third verdict is what kills you, so a good classifier never returns it.
A retryable error is one where the same input on the same code path could plausibly succeed on the next attempt. Network timeouts, 503s from a downstream, ConnectionResetError, a brief database deadlock, an HTTP 429 with a Retry-After header — these all qualify. The world is the same; you just need to wait and try again. Why: the operation has not been attempted-and-rejected on its merits — it has been interrupted by infrastructure, so re-attempting can change the outcome without changing the input.
A terminal error is one where retrying is guaranteed to fail and dangerous to attempt. A ValidationError because the payload is missing a required field. A KeyError because the PIN-code lookup table doesn't have an entry. An HTTP 400 from a downstream API. A foreign-key violation because the referenced order doesn't exist. Why: the input is malformed or refers to state that does not exist; the next attempt with the same input will fail with the same error, and retrying just wastes consumer time and downstream capacity.
The dangerous third class — "don't know" — usually shows up as a generic Exception or a 500 from a downstream that didn't include a structured error code. The temptation is to treat unknown errors as retryable ("better safe than sorry"). The production reality is that unknown errors are usually bugs, and bugs are usually deterministic, so retrying them just amplifies the cost. The right default for unknown errors is terminal — to DLQ on first occurrence, where a human reviews them and either fixes the bug or reclassifies the error as retryable in the next deploy. Why: optimising for the unknown case the same way you optimise for the known-transient case is what produces retry storms — every bug becomes a multiplier on consumer load and downstream pressure, and the bug's impact is hidden behind a retry-success metric instead of being visible in a DLQ.
A working retry-and-DLQ classifier in Python
Here is a runnable implementation of the three pieces — a classifier, an exponential-backoff retry, and a DLQ destination — that you can drop into any consumer.
import random, time, json, logging
from dataclasses import dataclass, field
from typing import Callable, Any
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
log = logging.getLogger("mealrush.consumer")
class TransientError(Exception): pass
class TerminalError(Exception): pass
def classify(exc: Exception) -> str:
# Known-transient — network, downstream throttling, deadlocks
if isinstance(exc, (TimeoutError, ConnectionResetError, BlockingIOError)):
return "transient"
msg = str(exc).lower()
if "503" in msg or "429" in msg or "deadlock" in msg or "connection reset" in msg:
return "transient"
# Known-terminal — bad input, missing reference, validation
if isinstance(exc, (KeyError, ValueError, TypeError)):
return "terminal"
if "400" in msg or "404" in msg or "validation" in msg or "foreign key" in msg:
return "terminal"
# Unknown — default to terminal so bugs surface in the DLQ, not as retry storms
return "terminal"
@dataclass
class Message:
id: str
payload: dict
attempts: int = 0
history: list = field(default_factory=list)
def with_retry_then_dlq(handler: Callable[[Message], Any], msg: Message,
max_attempts: int = 5, base_ms: int = 200,
cap_ms: int = 30_000, dlq=None):
while msg.attempts < max_attempts:
msg.attempts += 1
try:
handler(msg)
log.info(f"OK id={msg.id} attempt={msg.attempts}")
return "success"
except Exception as e:
verdict = classify(e)
msg.history.append({"attempt": msg.attempts, "err": type(e).__name__,
"msg": str(e)[:80], "class": verdict})
if verdict == "terminal":
log.warning(f"DLQ id={msg.id} terminal err={type(e).__name__}")
dlq.append(msg); return "dlq_terminal"
# transient — exponential backoff with full jitter (AWS Architecture Blog)
sleep_ms = min(cap_ms, base_ms * (2 ** (msg.attempts - 1)))
sleep_ms = random.uniform(0, sleep_ms)
log.info(f"RETRY id={msg.id} attempt={msg.attempts} "
f"err={type(e).__name__} sleep_ms={sleep_ms:.0f}")
time.sleep(sleep_ms / 1000)
log.warning(f"DLQ id={msg.id} max-attempts-exhausted")
dlq.append(msg); return "dlq_exhausted"
# Demo: one terminal poison, one transient that recovers, one healthy.
def flaky_handler(m: Message):
if m.payload.get("pin") == "BAD": # terminal
raise KeyError(f"pin {m.payload['pin']} not in tax table")
if m.payload.get("trip", 0) > m.attempts: # transient first N attempts
raise TimeoutError("downstream tax-svc 503")
return "priced"
dlq = []
for m in [Message("o-1", {"pin": "BAD"}),
Message("o-2", {"trip": 2, "pin": "560001"}),
Message("o-3", {"trip": 0, "pin": "110001"})]:
with_retry_then_dlq(flaky_handler, m, dlq=dlq)
print("DLQ size:", len(dlq), "ids:", [m.id for m in dlq])
Sample run:
2026-04-28 12:01:02 DLQ id=o-1 terminal err=KeyError
2026-04-28 12:01:02 RETRY id=o-2 attempt=1 err=TimeoutError sleep_ms=132
2026-04-28 12:01:02 RETRY id=o-2 attempt=2 err=TimeoutError sleep_ms=288
2026-04-28 12:01:03 OK id=o-2 attempt=3
2026-04-28 12:01:03 OK id=o-3 attempt=1
DLQ size: 1 ids: ['o-1']
The shape of the run matters more than the exact timings. Line 1 — o-1 lands in the DLQ after a single attempt because KeyError is classified terminal. The system did not retry a malformed message four extra times to discover it was still malformed; it failed fast. Why: retrying a KeyError 5 times costs five attempts of consumer CPU and downstream load to learn the same fact you knew on attempt 1. Lines 2–4 — o-2 retries twice with growing jittered sleeps and then succeeds on attempt 3. The full-jitter formula (uniform(0, base * 2^(n-1))) is from the AWS Architecture Blog "Exponential Backoff and Jitter" and is the right default — equal-jitter and decorrelated-jitter exist but full-jitter has the simplest tail behaviour. Line 5 — o-3 succeeds on attempt 1, so the loop costs nothing.
Backoff strategies — why jitter is non-negotiable
A retry policy without jitter is a synchronisation device. When 12,000 consumers all see the same downstream blip and all retry at exactly t+1s, t+3s, t+7s, you have built a thundering herd that turns a 300 ms downstream hiccup into a 90-second cascading failure. Jitter — randomising the sleep time within a bounded range — desynchronises retries and spreads the load.
The four strategies to know:
- Fixed delay: sleep
kseconds between retries. Wrong by default. Synchronises every consumer to the same retry wave. - Exponential backoff: sleep
base * 2^(n-1). Better, but still synchronises if every consumer started failing at the same moment. - Exponential + full jitter: sleep
uniform(0, base * 2^(n-1)). The default. Spreads retries across the entire backoff window. - Decorrelated jitter: sleep
uniform(base, prev_sleep * 3). Slightly better tail behaviour for very long-tailed downstreams. AWS uses this in some of its SDKs.
A useful rule: the cap matters as much as the base. If your backoff goes 0.2s → 0.4s → 0.8s → 1.6s → 3.2s → 6.4s → 12.8s → 25.6s → 51.2s, by attempt 9 you are sleeping just under a minute between tries. If you do not cap that at, say, 30 seconds, retry storms can leave messages stuck for tens of minutes after the downstream has actually recovered. PaySetu's payments-status consumer once spent 27 minutes in backoff after a 90-second downstream blip because the cap was 600 seconds and four retries had already escalated past it.
Where the DLQ actually lives — by broker
The DLQ is not a magic facility — it is just another queue or topic that you write to when you give up. How you wire it depends on the broker.
Kafka: there is no first-class DLQ. The convention is to create a sibling topic named <source>.dlq (or .deadletter) with the same number of partitions. The consumer, on terminal error or after retry-count exhaustion, produces the message to the DLQ topic with a header set containing the original topic, partition, offset, error class, error message, and stack-trace pointer. Tools like Kafka Connect's errors.deadletterqueue.topic.name automate this for sink connectors.
RabbitMQ: native support via the x-dead-letter-exchange queue argument. When a message is rejected with requeue=false, exceeds its TTL, or is dropped because the queue is full, RabbitMQ routes it to the configured exchange. You bind a DLQ to that exchange. Why: RabbitMQ pushes the DLQ semantics into the broker because its consumer model is per-message ack/nack — the broker knows when a message has been definitively rejected, so it can take action without consumer cooperation.
SQS: native, configured via a RedrivePolicy on the queue: {"maxReceiveCount": 5, "deadLetterTargetArn": "..."}. After 5 receives without a delete, SQS moves the message to the DLQ automatically. The "receive count" semantics — incremented on every receive, never decremented — means a poison message that crashes the consumer five times in a row gets moved without any consumer cooperation.
NATS JetStream: the consumer's MaxDeliver setting puts a cap on redeliveries; messages exceeding it are dropped to a separate stream you configure as the DLQ. JetStream also exposes AckExplicit so the consumer must ack each message; on Term, the message is moved without further attempts.
Redis Streams: no native DLQ. Build it the same way as Kafka — a sibling stream events.dlq and a consumer that uses XADD events.dlq followed by XACK events <group> <id> on the original.
# Kafka-style: produce to a parallel DLQ topic with metadata headers.
from confluent_kafka import Producer
def to_dlq(orig_msg, err: Exception, attempts: int, dlq_producer: Producer):
dlq_producer.produce(
topic=f"{orig_msg.topic()}.dlq",
key=orig_msg.key(),
value=orig_msg.value(),
headers=[
("dlq-original-topic", orig_msg.topic().encode()),
("dlq-original-partition", str(orig_msg.partition()).encode()),
("dlq-original-offset", str(orig_msg.offset()).encode()),
("dlq-error-class", type(err).__name__.encode()),
("dlq-error-msg", str(err)[:200].encode()),
("dlq-attempts", str(attempts).encode()),
("dlq-failed-at", str(int(time.time())).encode()),
])
dlq_producer.poll(0)
The headers matter more than the payload. The original payload alone is useless for triage — without the error class, the attempt count, and the original offset, the operator inspecting the DLQ cannot tell whether the message hit a deterministic bug, a transient blip that exhausted the retry budget, or a downstream that has since been fixed. Always write the metadata.
Common confusions
- "DLQ is the same as a retry queue" — No. A retry queue holds messages that will be re-attempted automatically after a delay. A DLQ holds messages the system has given up on; they sit there until a human or a tool intervenes. Conflating them produces infinite-loop pipelines where messages bounce between the DLQ and the main queue forever.
- "Maximum retries should be high so we don't lose data" — High retry counts hide bugs. A
KeyErroron a missing PIN code does not become aValueErroron attempt 12. Five attempts is plenty for transient errors; one attempt is enough for terminal ones. Data is not lost in the DLQ — it is parked, with full provenance. - "Retries with backoff are safe" — Retries amplify load on the downstream that just told you it was overloaded. An open circuit breaker (see /wiki/circuit-breaker-implementation) is often the right answer; retry alone can turn a recoverable downstream into an unrecoverable one.
- "DLQ is for the consumer to ignore" — A DLQ no human reads is a graveyard. Treat the DLQ depth as a first-class SLO with a paging threshold. MealRush pages the on-call when any DLQ exceeds 100 messages, because by 1,000 the engineer has already lost the context to triage them.
- "Idempotency makes retries safe" — Idempotency makes retries correct (you don't double-process). It does not make them cheap — every retry still costs CPU, network, and downstream capacity. See /wiki/at-least-once-idempotency-in-practice.
- "Re-driving from DLQ is just
mvback to the main queue" — Almost never. The bug that put the message in the DLQ is usually still there until a deploy ships, so re-driving without a code fix just re-DLQs the same messages. Re-drive policies should be: deploy the fix, then re-drive, with rate limiting.
Going deeper
Retry budgets and how Envoy / Istio enforce them
A naive consumer-side retry policy says "retry up to 3 times". A retry budget says "retries must be at most 20% of the total request volume to this upstream". Envoy implements this via retry_budget: the proxy tracks the recent ratio of retry-attempts to original-attempts and refuses to retry once the ratio exceeds the configured threshold. Why: any per-request retry policy fails under load — when 90% of requests start failing, "retry 3 times" multiplies load by 4×, exactly when the system can least afford it. A budget rate-limits retries as a fraction of total traffic, so retries shrink as the failure rate grows. The 20% default in many service meshes is empirical: low enough to prevent retry-amplification disasters, high enough to recover from typical transient blips. KapitalKite's order-routing layer uses Envoy retry budgets and saw their cascade-failure incidents drop from "monthly" to "have not had one in 14 months".
Poison-pill detection and isolation
Some payloads are so badly malformed that the consumer crashes before it can even classify the error and write to the DLQ. The classic case: a serialiser bug where ProtobufDecoder.parse(bytes) segfaults the JVM on a specific 3-byte prefix. The consumer restarts, picks up the same offset, crashes again — a crash loop, not a retry loop. Detection: track the count of crashes per consumer-id within a 60-second window. After 3 crashes on the same offset, the orchestrator (Kubernetes, ECS) should mark that offset as poisoned and either auto-skip with a metric event, or page a human. Real-world pattern from Discord's Elixir consumers: a per-message try that catches even :exit signals, with a fallback that writes to a separate crash.dlq partition with a special header indicating "consumer did not survive parse". This separates "couldn't process" from "couldn't parse".
Re-drive automation — the slow ramp
Re-driving from a DLQ is almost always rate-limited. If 41,000 messages went to the DLQ during a 90-minute downstream outage, dumping them all back into the main queue at full speed will simply overwhelm the now-recovered downstream. The pattern: a re-drive script that reads from the DLQ at a configurable QPS (often 10–20% of normal traffic), back-pressures on the consumer's lag, and stops if the DLQ refill rate exceeds the drain rate (which means the bug is not actually fixed). MealRush's re-drive script is 80 lines of Python that consumes from *.dlq, applies a token-bucket of 50 messages/sec, produces back to the original topic, and aborts if the same dlq-error-class reappears within 30 seconds.
Why "retry once and DLQ" is usually right for terminal errors
The temptation is symmetric — retry transient and terminal errors with the same max_attempts=5, on the theory that classification might be wrong. The right asymmetry is: retryable errors get the full 5 attempts with backoff; terminal errors go to the DLQ on the first occurrence. Why: if your classification is correct, terminal errors waste 4 extra retries; if your classification is wrong, you'll discover that in the DLQ within minutes (because the DLQ is monitored), and the cost of fixing the classifier is one deploy. The cost of running the classifier wrong with max_attempts=5 is 5× the consumer load and 5× the downstream pressure for every bug, hidden behind a "retry-success" curtain.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install confluent-kafka
# Save the with_retry_then_dlq snippet as retry_dlq.py and run:
python3 retry_dlq.py
# Expected: 1 terminal DLQ, 1 transient recovered, 1 healthy.
Where this leads next
DLQs are the catch-all on the at-least-once + idempotency pipeline; the chapters that surround this one fill in adjacent pieces:
- /wiki/at-least-once-idempotency-in-practice — why every retry needs an idempotency key, and where that key comes from.
- /wiki/exactly-once-and-the-semantics-debate — the marketing-vs-mechanics gap that DLQs make visible.
- /wiki/circuit-breaker-implementation — the cousin pattern: when retrying is wrong even for transient errors because the downstream needs breathing room.
- /wiki/saga-pattern-compensating-actions — when a DLQ'd message represents a half-completed business transaction, compensation, not retry, is the answer.
The discipline that ties them together: every message in your system has a final destination. Either it succeeded, or it is parked in a DLQ awaiting human review, or it has been compensated. Anything else is a leak — and leaks in messaging pipelines are how money disappears six months later in a quarterly audit.
References
- AWS Architecture Blog — Exponential Backoff and Jitter — the canonical post on full-jitter and decorrelated-jitter.
- Kafka Connect — error handling and DLQs — the
errors.deadletterqueue.topic.nameconfiguration and what fields are populated. - AWS SQS — Dead-letter queues —
RedrivePolicy,maxReceiveCount, and re-drive APIs. - RabbitMQ — Dead Letter Exchanges — the
x-dead-letter-exchangeargument and triggers (reject, TTL, queue length). - Envoy Proxy — Retry budgets — proxy-level retry-rate limiting.
- Tyler Treat, "You Cannot Have Exactly-Once Delivery" (2015) — the foundational blog post on what DLQs are actually catching.
- /wiki/at-least-once-idempotency-in-practice — the upstream chapter on why retries happen at all.
- Marc Brooker, "Timeouts, retries, and backoff with jitter" (Amazon Builders' Library) — production-tested defaults for the parameters in this chapter.