The executor: local, Celery, Kubernetes

Tuesday, 02:14 IST. Aditi's phone buzzed: the Razorpay settlement DAG had 47 tasks queued and 0 running. The scheduler was healthy. The metadata DB was healthy. The Celery workers had silently OOM-killed 90 minutes earlier because one rogue transform_orders task allocated a 14 GB DataFrame, and the supervisor had not respawned them. Every other DAG in the cluster was now stuck behind dead workers. The fix took 3 minutes — kubectl rollout restart deployment/airflow-celery-worker — but the question that kept Aditi awake was structural: why does a single bad task take down the entire worker pool? The answer lives in the executor, the layer between "the scheduler decided this should run" and "this Python actually runs somewhere".

The executor decides where a scheduled task actually executes — in the scheduler's process (LocalExecutor), on a long-running worker pool (CeleryExecutor), or in a fresh container per task (KubernetesExecutor). Each choice picks a different point on the trade-off triangle of startup latency, resource isolation, and operational simplicity, and the right pick depends on the shape of your workload, not the framework's marketing.

What the executor actually does

The scheduler decides what should run and when. The executor decides where and how. A scheduled task instance moves through three states the executor owns: queued (the scheduler has marked it ready), running (the executor has handed it to compute), and done (the executor has reported the exit status back). Every framework — Airflow, Dagster, Prefect, your own DAG executor from chapter 23 — has this layer, even when it isn't named explicitly.

Where the executor sits between scheduler and computeA four-layer diagram. Top: scheduler box that writes task instances to a queue. Middle: executor box receiving from queue, with three sub-shapes showing local in-process, Celery worker pool, and Kubernetes pod-per-task. Bottom: compute layer where the user code runs. Arrows show task instance flowing down and exit status flowing back up.Scheduler decides what; executor decides wherescheduler — picks ready tasks from DAGs, writes to queuetask_instanceexecutor — picks task off queue, places it on computeLocalExecutorsubprocess in schedulerno network hopCeleryExecutorlong-running workersRedis/RabbitMQ brokerKubernetesExecutorfresh pod per taskimage-per-team possibleuser codecompute — runs user code, reports exit status back up
The executor is a placement layer. The scheduler does not care whether a task ran on a process, a worker, or a pod — it only cares that the task reached a terminal state. The executor's choices about where tasks run determine startup latency, isolation, and the failure modes you have to debug.

Three properties define an executor: placement (where does the user code run — same process, remote worker, fresh container?), lifecycle (how long does the runtime live — task duration, days, forever?), and isolation (what does one task share with another — memory, dependencies, OS user, network, nothing?). Every executor is a different choice across those three axes, and the same DAG can behave very differently depending on which executor you put underneath it.

The framework you use barely matters for this layer. Airflow's LocalExecutor, Celery, and Kubernetes executors are the canonical names; Dagster has MultiprocessExecutor, CeleryExecutor, K8sRunLauncher; Prefect has Process workers, Docker workers, Kubernetes workers. The names differ, the trade-offs are identical. Why this generalises across frameworks: the executor is solving a placement problem, and placement problems have a fixed catalog of solutions — share a process, pool long-running workers, or spin up an isolated container per task. Every workflow tool eventually offers all three because they cover the workload spectrum from cheap-and-fast to heavy-and-isolated.

LocalExecutor — the same Python process

LocalExecutor is the simplest thing that works. The scheduler forks a subprocess for each task, the subprocess runs the user code, the exit status comes back via the OS, and the scheduler updates the metadata DB. There is no broker, no worker pool, no remote call.

# A 60-line LocalExecutor — the entire mechanism
import subprocess, multiprocessing, queue, time, os, signal, sys
from dataclasses import dataclass
from typing import Optional

@dataclass
class TaskInstance:
    task_id: str
    dag_id: str
    command: list[str]                    # e.g. ["python", "-m", "tasks.extract"]
    state: str = "queued"
    pid: Optional[int] = None
    exit_code: Optional[int] = None

class LocalExecutor:
    def __init__(self, parallelism: int = 4):
        self.parallelism = parallelism
        self.queue: queue.Queue = queue.Queue()
        self.running: dict[str, subprocess.Popen] = {}

    def queue_task(self, ti: TaskInstance) -> None:
        self.queue.put(ti)

    def heartbeat(self) -> list[TaskInstance]:
        """Called by the scheduler every ~5 seconds. Reaps finished tasks
        and starts new ones up to parallelism."""
        finished = []
        # Reap completed processes
        for tid, proc in list(self.running.items()):
            if proc.poll() is not None:                 # exited
                ti = proc._ti
                ti.state = "success" if proc.returncode == 0 else "failed"
                ti.exit_code = proc.returncode
                finished.append(ti)
                del self.running[tid]
        # Start new tasks up to parallelism
        while len(self.running) < self.parallelism and not self.queue.empty():
            ti = self.queue.get()
            proc = subprocess.Popen(ti.command, env={**os.environ, "AIRFLOW_TASK": ti.task_id})
            proc._ti = ti                               # smuggle for reaping
            ti.pid = proc.pid
            ti.state = "running"
            self.running[ti.task_id] = proc
        return finished

if __name__ == "__main__":
    ex = LocalExecutor(parallelism=2)
    for i in range(4):
        ex.queue_task(TaskInstance(f"t{i}", "demo", ["sleep", "2"]))
    while ex.running or not ex.queue.empty():
        for ti in ex.heartbeat():
            print(f"{ti.task_id} -> {ti.state} (rc={ti.exit_code})")
        time.sleep(0.5)
# Sample run:
$ python local_executor.py
t0 -> success (rc=0)
t1 -> success (rc=0)
t2 -> success (rc=0)
t3 -> success (rc=0)
$ # Wall time: 4.1s for 4 tasks at parallelism=2

The mechanism is genuinely this small. subprocess.Popen(ti.command) spawns a new OS process — fast on Linux (a few milliseconds for fork+exec), and the parent retains a Popen handle to query exit status. proc.poll() returns None while the child is alive and the exit code once it has exited; this is non-blocking and is what makes the heartbeat efficient. proc._ti = ti is the executor's only piece of bookkeeping — when the OS reaps a child, the executor needs to know which task instance it corresponds to. parallelism=4 caps concurrent processes; this is also typically the number of cores you give the scheduler box. Why parallelism is bounded by cores, not RAM, for LocalExecutor: most data tasks are CPU-bound (pandas, numpy, parquet read/write), and oversubscribing cores costs more in context-switch overhead than it gains in concurrency. Memory is bounded by the host — there is no per-task memory limit because every task shares the scheduler's box.

LocalExecutor's strengths are operational. There is no broker to keep healthy, no worker pool to autoscale, no Kubernetes cluster to admin. A single VM running the scheduler runs the executor and the workers; deploys are git pull && systemctl restart airflow-scheduler. The Razorpay data team's first 18 months ran on LocalExecutor with a single 16-core box, processing 4,000 task instances per day. The reason to leave LocalExecutor is when one of three things breaks. (i) A single rogue task with a 30 GB pandas DataFrame OOM-kills the scheduler box; everything dies. (ii) Total daily compute exceeds what one box can do — at 16 cores × 24 hours = 384 core-hours, you can run ~46k single-core minute-long tasks per day, more than enough for most teams but not for a Flipkart-scale catalog. (iii) Different tasks need different Python environments (TensorFlow 2.15 vs 2.18, polars 1.0 vs 0.20), and they cannot coexist in one virtualenv. At that point you graduate.

CeleryExecutor — long-running worker pool

CeleryExecutor decouples the scheduler from compute. The scheduler enqueues tasks into a message broker (Redis or RabbitMQ); a separate fleet of long-running worker processes consume from the broker and run tasks. The brokers handle delivery; the workers handle execution. The scheduler box can be a t3.medium VM doing nothing but scheduling, while the workers run on c5.4xlarge boxes optimised for CPU-bound work.

# CeleryExecutor — the conceptual shape
# In Airflow's airflow.cfg: executor = CeleryExecutor
# In airflow.cfg [celery] section: broker_url = redis://redis:6379/0

from celery import Celery
import subprocess, time

# This is the worker — runs as a separate process, possibly on a separate box.
app = Celery("airflow_worker", broker="redis://redis:6379/0",
             backend="redis://redis:6379/1")

@app.task(bind=True, max_retries=0, acks_late=True)
def execute_task(self, task_id: str, dag_id: str, command: list[str]) -> dict:
    """The worker runs this. It's just subprocess.run, plus retry/heartbeat."""
    print(f"[worker {self.request.hostname}] starting {dag_id}.{task_id}")
    started = time.time()
    proc = subprocess.run(command, capture_output=True, text=True, timeout=3600)
    return {
        "task_id": task_id,
        "dag_id": dag_id,
        "exit_code": proc.returncode,
        "duration_s": time.time() - started,
        "stdout_tail": proc.stdout[-2000:],
        "worker_host": self.request.hostname,
    }

# This is what the executor (running inside the scheduler) does to dispatch.
class CeleryExecutorShape:
    def __init__(self):
        self.app = app
        self.results: dict[str, "AsyncResult"] = {}

    def queue_task(self, ti):
        # Enqueue → broker (Redis) → first idle worker picks it up
        async_result = execute_task.apply_async(
            args=[ti.task_id, ti.dag_id, ti.command],
            queue=ti.queue or "default",
            task_id=f"{ti.dag_id}__{ti.task_id}__{int(time.time())}",
        )
        self.results[ti.task_id] = async_result

    def heartbeat(self):
        finished = []
        for task_id, ar in list(self.results.items()):
            if ar.ready():
                result = ar.get(timeout=0)
                finished.append((task_id, result))
                del self.results[task_id]
        return finished
# Worker output (running on box airflow-worker-3):
$ celery -A airflow_worker worker --concurrency=4 --queues=default,heavy
[2026-04-25 02:14:01] celery@airflow-worker-3 ready.
[2026-04-25 02:14:03] [worker airflow-worker-3] starting upi_daily.extract
[2026-04-25 02:14:09] [worker airflow-worker-3] starting upi_daily.transform
[2026-04-25 02:14:14] task upi_daily.extract succeeded in 6.1s
[2026-04-25 02:14:22] task upi_daily.transform succeeded in 13.0s

# Broker view (Redis MONITOR):
LPUSH celery {"task": "execute_task", "args": ["extract", "upi_daily", ...]}
BRPOP celery 0     # worker 3 pops
LPUSH celery_results__extract {"exit_code": 0, "duration_s": 6.1, ...}

The interesting changes from LocalExecutor live in three places. apply_async(...) is a non-blocking publish to the broker; the executor returns instantly with an AsyncResult handle and never knows or cares which worker picked the task up. acks_late=True is critical: a worker that picks up a task but crashes before completing it will have the task re-delivered to another worker, because Celery only acknowledges the broker after the task body finishes. Without acks_late, a worker crash silently drops the task. queue=ti.queue or "default" lets you route specific tasks to specific worker pools — a worker started with --queues=heavy only picks up tasks tagged heavy, which is how you isolate memory-hungry tasks (the Razorpay incident at the top of this article would have been contained if transform_orders had been routed to a heavy queue with its own pool).

Celery's strengths are scale and decoupling. Workers can live in a different VPC from the scheduler, you can have 100 of them, you can autoscale them based on queue depth, you can deploy them independently. The cost is operational surface. Why Celery is a meaningful operational burden: you now have three things to keep healthy — the scheduler, the broker (Redis or RabbitMQ), the workers — and any one of them being unhealthy stops your pipelines. Brokers fill up if workers are slow; workers leak memory if Python tasks misbehave; the scheduler thinks tasks are running when in fact the worker died. Most "Airflow stopped working" pages on a Celery deployment are actually broker or worker pages, not scheduler ones.

The other Celery pitfall is shared dependencies. All workers in a pool run the same image with the same Python version and the same installed packages. A team that wants TensorFlow 2.18 and a team that wants TensorFlow 2.15 cannot share a Celery pool — they need either separate pools or a different executor. This is the single most common reason teams graduate from Celery to Kubernetes.

KubernetesExecutor — a fresh pod per task

KubernetesExecutor pushes isolation to the extreme: each task gets its own Kubernetes pod, with its own container image, resource limits, network policies, and lifecycle. When the task finishes, the pod is destroyed. The scheduler talks to the Kubernetes API; the executor's job is to translate "run this task" into "kubectl apply this Pod manifest" and then watch the pod to completion.

# KubernetesExecutor — what the scheduler does to launch a task
from kubernetes import client, config, watch
import yaml, time

config.load_incluster_config()       # we run inside the cluster
api = client.CoreV1Api()

def launch_task_pod(ti) -> str:
    pod_manifest = {
        "apiVersion": "v1", "kind": "Pod",
        "metadata": {
            "name": f"task-{ti.dag_id}-{ti.task_id}-{int(time.time())}".lower(),
            "namespace": "airflow-tasks",
            "labels": {"airflow_dag": ti.dag_id, "airflow_task": ti.task_id},
        },
        "spec": {
            "restartPolicy": "Never",
            "containers": [{
                "name": "task",
                "image": ti.image or "razorpay/data-platform:2026.04",
                "command": ti.command,
                "resources": {
                    "requests": {"cpu": ti.cpu_request or "500m",
                                 "memory": ti.mem_request or "1Gi"},
                    "limits":   {"cpu": ti.cpu_limit   or "2",
                                 "memory": ti.mem_limit   or "4Gi"},
                },
                "env": [{"name": "AIRFLOW_TASK", "value": ti.task_id}],
            }],
            "serviceAccountName": "airflow-task",
            "nodeSelector": ti.node_selector or {},
        },
    }
    pod = api.create_namespaced_pod(namespace="airflow-tasks", body=pod_manifest)
    return pod.metadata.name

def watch_pod(pod_name: str) -> int:
    """Return the container exit code once the pod terminates."""
    w = watch.Watch()
    for ev in w.stream(api.list_namespaced_pod, namespace="airflow-tasks",
                       field_selector=f"metadata.name={pod_name}", timeout_seconds=3600):
        phase = ev["object"].status.phase
        if phase in ("Succeeded", "Failed"):
            cs = ev["object"].status.container_statuses[0]
            return cs.state.terminated.exit_code
    raise TimeoutError(f"pod {pod_name} did not terminate within 1h")
# Cluster view (kubectl during a DAG run):
$ kubectl -n airflow-tasks get pods
NAME                                          READY   STATUS      RESTARTS   AGE
task-upi-daily-extract-1745540041-x9k         1/1     Running     0          12s
task-upi-daily-transform-1745540052-bq2       0/1     Pending     0          1s
task-credit-risk-features-1745540020-mn4      1/1     Running     0          43s

$ kubectl -n airflow-tasks describe pod task-upi-daily-extract-1745540041-x9k
Containers:
  task:
    Image:        razorpay/data-platform:2026.04-extract
    Command:      [python, -m, tasks.upi_extract]
    Limits:       cpu: 2  memory: 4Gi
    Requests:     cpu: 500m  memory: 1Gi

The mechanism is surprisingly simple — the Kubernetes API does the heavy lifting. pod_manifest is a YAML-equivalent dictionary describing the pod; the executor's job is to fill in the container image, command, and resources from the task's metadata. restartPolicy: Never is critical: we want exactly-one-attempt semantics from the pod's perspective. Retries are owned by the scheduler, not Kubernetes; if the pod crashes, the executor reports failure and the scheduler decides whether to enqueue another attempt. resources.limits is the killer feature — the OOM-kill from the opening story is contained to one pod, the rest of the cluster is unaffected. image: ti.image lets every task pick its own container image — the ML team's pod can ship a 6 GB CUDA image, the analytics team's pod can ship a 200 MB pandas image, and they share nothing.

Kubernetes' strengths and weaknesses both come from "fresh pod per task". The strength is total isolation: every task has its own filesystem, its own Python, its own resource budget, its own service account. A Kubernetes deployment can span multiple teams, multiple Python versions, multiple framework versions. The weakness is startup latency. Pulling a 4 GB image from a registry can take 20–60 seconds; the pod scheduler has to find a node with capacity; the container has to start. A task that runs for 10 seconds incurs a 30-second startup tax — the executor is now 75% overhead. Why startup latency dominates for short tasks: pod creation is dominated by image pull and pod-scheduler assignment, both of which are roughly fixed costs that don't shrink for small tasks. A pod-per-task model is amortised when individual tasks run for minutes; it is wasted when tasks run for seconds. Teams running thousands of sub-second tasks per day on KubernetesExecutor see 80%+ of compute go to pod startup, not actual work.

The fix is KubernetesPodOperator for some tasks and CeleryExecutor for most — Airflow lets you mix executors at the task level. The pattern that works in 2026: CeleryExecutor as the default for short, homogeneous tasks; KubernetesPodOperator for long-running heavy tasks (a 4-hour Spark submit, a multi-GPU training job) where startup latency is rounding error.

How each one breaks at 3 a.m.

The honest comparison is in failure modes. The Razorpay incident, the Flipkart Big Billion Days war stories, the Zerodha trading-day outages — every one of them is a story about an executor failing in a specific way.

Failure modes by executorA 3-column comparison table-like visualisation. Each column is an executor. Rows show failure modes: rogue task OOM (Local: kills box, Celery: kills worker, K8s: contained to pod), dependency conflict (Local: cannot solve, Celery: per-pool image, K8s: per-task image), startup tax (Local: ms, Celery: ms, K8s: seconds-to-minutes), broker outage (Local: N/A, Celery: stops, K8s: N/A). Each cell uses colour to indicate severity.What goes wrong, and whereFailure modeLocalCeleryKubernetesRogue task OOM-killsscheduler box diesworker diescontained to podDependency conflictno solutionper-pool imageper-task imageStartup latencyms (fork)ms (worker idle)10–60 s (pod create)Broker / control-plane outageN/Aall tasks stopall tasks stopPer-task resource limitsnonecgroup, softhard, k8s-enforcedOperational surface1 boxscheduler+broker+workers+ a k8s cluster
The accent-coloured cells are where each executor will eventually page you. LocalExecutor's "scheduler box dies" is the most common reason teams leave it; Kubernetes' "10–60 s startup latency" is the most common reason teams move some tasks back to Celery. Pick the executor whose accent cells you can tolerate operationally.

The "single bad task kills the cluster" story (LocalExecutor). A Bengaluru ed-tech ran LocalExecutor on a 32-core box with 100+ daily DAGs. One task started returning 80M rows from a SQL query into a pandas DataFrame after a schema change upstream. The task allocated 60 GB on a 64 GB box, the OOM-killer triggered, the kernel sent SIGKILL to whatever was using the most memory — and that was the Airflow scheduler itself. All 47 in-progress tasks were orphaned. The team migrated to CeleryExecutor that quarter; the same bug now kills one Celery worker, the scheduler stays up, the supervisor respawns the worker in 30 seconds.

The "broker is full" story (CeleryExecutor). Flipkart's pre-Big-Billion-Days dry run in 2024 had a Redis broker filling at 12 GB / hour because workers were OOM-killed faster than they could ack tasks. The supervisor respawned them, they picked up the tasks again, OOM-killed again — a respawn loop that piled the broker high until Redis hit its maxmemory and started rejecting writes. The scheduler could no longer enqueue new tasks; the entire pipeline stopped. The fix was a per-queue worker memory limit and OOM detection that quarantines a task instead of redelivering it after 3 attempts.

The "wrong image on Friday afternoon" story (KubernetesExecutor). Zerodha's trading-day pipeline runs the EOD aggregation at 16:00 IST sharp. One Friday in 2025, a deployment merged a new image tag (razorpay/eod:2025.10.18-rc2) that was almost identical to the previous image — but the pyarrow version had jumped from 12.0.0 to 14.0.0 and the Iceberg writer behaviour changed. The pod ran for 4 minutes, wrote 80% of the rows, and crashed on a schema change in the manifest. The pod was Killed; the executor reported failure; the scheduler retried; same image, same crash. By 17:30 the on-call had to manually pin the image tag back to :2025.10.04. Lesson: KubernetesExecutor's per-task image is a knife with no guard — every image change is a deploy, and there is no smoke test in front of it.

The "node-pool is full" story (KubernetesExecutor). A Bengaluru fintech autoscaled their EKS node pool from 4 to 40 nodes during the daily 02:00 IST batch. Pods went from Pending to Scheduled in seconds. One Tuesday, the AWS region ap-south-1a had a capacity shortage — the autoscaler tried to add 12 m5.4xlarge nodes and got 4. Pods piled up in Pending. Tasks did not run. The scheduler did not know — the executor said "pod created, watching" — and the SLA-miss callbacks fired 90 minutes later. The fix was multi-AZ node groups and a Pending timeout in the executor that escalates after 5 minutes.

The pattern: each executor has a different layer where the failure happens. LocalExecutor failures are at the OS layer (memory, file descriptors, fork). CeleryExecutor failures are at the broker / worker layer (queue depth, ack semantics, worker respawn loops). KubernetesExecutor failures are at the cluster layer (node capacity, image registry, pod scheduling). Your on-call's expertise has to match the executor you pick. A team that doesn't have a Kubernetes admin should not run KubernetesExecutor; a team that doesn't have a Redis admin should not run CeleryExecutor at scale.

Common confusions

Going deeper

Celery's acks_late and the duplicate-execution risk

acks_late=True means the broker is acked only after the task body finishes — guaranteeing at-least-once delivery. acks_late=False (the default in older Celery versions) acks immediately upon receipt — guaranteeing at-most-once. For data tasks that write to a sink, acks_late=True is almost always correct, because losing a task silently is worse than running it twice (assuming the task is idempotent — see chapter 8). The cost is duplicate execution: a worker that completes a task and crashes before acking will have the task redelivered and re-run. This is why every long-running data task needs a deduplication key on its output (chapter 9). The Celery + idempotency pair is what makes the executor safe; either alone is not enough.

Kubernetes pod-scheduling and the headroom-vs-utilisation trade-off

KubernetesExecutor's startup latency is dominated by two things: image pull (slow if the image is fat or the registry is slow) and pod scheduling (slow if the cluster is full). Both can be tuned. Image pull is fixed by image pre-pulling — running a DaemonSet that pulls the common base image on every node as soon as the node joins. Pod scheduling is fixed by headroom — keeping 10–20% of cluster capacity always idle so new pods schedule instantly. The trade-off: headroom costs money (you pay for nodes you don't use), but it buys low p99 startup latency. A Bengaluru ML platform pays roughly ₹35 lakh/year extra for 15% headroom on their 800-node cluster, in exchange for p99 pod startup of 4 seconds vs the 90-second p99 they had at 95% utilisation.

Why Airflow's CeleryKubernetesExecutor exists

Airflow ships an explicit CeleryKubernetesExecutor that routes tasks to either Celery or Kubernetes based on a queue parameter. Tasks tagged queue="kubernetes" go to the Kubernetes side; everything else goes to Celery. This is the production answer to the "fast tasks on Celery, heavy tasks on Kubernetes" pattern, baked into the framework. The mechanism is straightforward: the executor is a thin shim that forwards each task to the underlying executor based on its queue. The interesting consequence is that DAG authors don't have to know which executor they are using — they just tag heavy tasks with executor_config={"KubernetesExecutor": {...}} and the platform team handles the routing.

Dagster's run-launcher and step-launcher split

Dagster splits the executor into two layers. The run launcher decides where the orchestration process for a DAG run lives — typically in a Kubernetes pod (K8sRunLauncher) or a long-running daemon (DefaultRunLauncher). The step launcher then decides where each individual asset materialisation runs — in the same process, in a Celery worker, or in a separate Kubernetes pod. This is more flexibility than Airflow exposes, and is the reason Dagster scales cleanly to mixed batch + heavy-Spark workloads: the run launcher is lightweight, the step launcher is per-asset, and the two compose orthogonally.

What changes in 2026: serverless executors and the death of long-running workers

By 2026, AWS Step Functions, GCP Workflows, and Modal-style serverless compute have made "fresh function-per-task" cheap enough that some teams skip Kubernetes entirely. The startup latency gap has closed: AWS Lambda cold-start is 200–500ms in 2026, vs 10–60s for a Kubernetes pod. The cost model is also better for spiky workloads — you pay for ms of compute, not for idle worker-hours. The emerging pattern: a small team running ~1000 tasks/day on Modal pays roughly ₹4,000/month total, vs ₹40,000/month for an EKS cluster idle most of the time. The trade-off is portability — if your tasks are aws_lambda.invoke()-shaped, you are locked in. The longer-term direction is for Airflow / Dagster / Prefect to ship "Modal executor" / "Lambda executor" plugins, which are starting to appear; the executor abstraction was always going to outlive any specific compute layer.

Where this leads next

The executor is the layer that makes a workflow framework production-shaped. It is also the layer that gets the least documentation in vendor blog posts, because the trade-offs are operational and don't fit into a feature comparison. The right way to learn the executor is to watch one fail at 3 a.m. — to see a Celery worker pool stall, a Kubernetes pod stay Pending for 8 minutes, a LocalExecutor scheduler box go dark — and learn from each one which axis of the trade-off triangle bit you.

References

  1. Airflow executors documentation — the official catalog: Local, Celery, Kubernetes, CeleryKubernetes, Local-Kubernetes.
  2. Celery: at-least-once and acks_late — the delivery-semantics docs that every Celery deployment needs.
  3. Kubernetes Pod Lifecycle — the canonical reference for pod phases (Pending, Running, Succeeded, Failed) that the executor watches.
  4. Dagster's run launcher / step launcher split — the architecturally clean version of the executor abstraction.
  5. Astronomer's "Choosing an Airflow executor" — the practitioner-facing guide with deployment-shape recommendations.
  6. Airflow CeleryKubernetesExecutor source — read the code; it's a 200-line shim that explains the pattern.
  7. Backfills: re-running history correctly — chapter 24, where executor parallelism interacts with partition re-processing order.
  8. Airflow vs Dagster vs Prefect — chapter 26, the framework-level companion to this chapter.