Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The DAG as the right abstraction

The previous chapter named cron's three structural flaws. Flaw 1 — dispatch by wall-clock time instead of by upstream completion — was the worst of the three. The fix is not "a smarter cron". It is a different data structure. Once you model a pipeline as a graph whose nodes are tasks and whose edges are "task B depends on task A's output", every dispatch decision becomes a question about graph state — has every parent finished? — and never about a wall clock. That single change rearranges what a scheduler is.

A DAG — directed acyclic graph — is the right abstraction for a pipeline because the dependency relation between tasks is exactly what a graph encodes: directed (B reads A's output, not vice versa), acyclic (a cycle means a task waits on itself), graph (a task may have multiple parents and multiple children). Adopting the DAG is what turns "fire commands at 02:30" into "execute this task when its upstream is done", which is the property cron lacked.

What a DAG is, precisely

A directed acyclic graph is a set of nodes and a set of directed edges between them, with the constraint that the edges contain no cycles. "Directed" means each edge has a direction — the edge A → B is not the same as B → A. "Acyclic" means you cannot start at any node, follow edges, and return to where you started. The two constraints together give the structure its key property: a topological order exists. You can list the nodes in a sequence such that every edge goes from an earlier node to a later one. That sequence is the order in which tasks can run.

In a data pipeline, the nodes are tasks (a SQL query, a Python script, a dbt model, a Spark job), and the edges represent data dependency: an edge A → B means "the output of A is the input of B, so B must wait until A succeeds". The acyclicity constraint corresponds to the obvious physical fact that you cannot compute B from A and also compute A from B in the same run — that would be a circular dependency, and there would be no way to start.

A six-task DAG for a daily revenue dashboard. The two extract tasks have no incoming edges — they are sources. The load task has no outgoing edges — it is a sink. The shape encodes which tasks can run in parallel (the two extracts) and which must wait (the join waits for both cleans).

The simplest DAG is a chain: A → B → C, three nodes, two edges, a single linear order. The simplest non-trivial DAG is a fan-out: A → B, A → C — task A produces output that feeds two independent downstream tasks that can run in parallel. The simplest DAG with fan-in is the inverse: A → C, B → C — two upstream tasks both feed a single downstream task that must wait for both. Real data pipelines combine fan-out and fan-in repeatedly, producing the diamond, lattice, and tree shapes that dominate production DAG topology.

Why "directed acyclic graph" rather than just "graph": directionality is what encodes "B depends on A". Without direction, the relation is symmetric and you cannot tell which task should run first. Acyclicity is what makes a topological order exist. Without it, two tasks could be mutually waiting on each other and the scheduler would deadlock. The two constraints are not arbitrary — they are the minimum machinery to express "this task can start when these specific other tasks have finished, and the structure is sound enough to execute".

Why the DAG closes flaw 1

cron's flaw 1 — dispatch by wall-clock time, not by upstream state — has a one-line fix once you have a DAG: the dispatch decision becomes "every parent of this task is in state SUCCESS". The wall clock disappears from the dispatch logic entirely. The wall clock can still gate the roots of the DAG (extract jobs that pull from external systems on a schedule), but downstream tasks fire on event, not on time.

This single rearrangement collapses an entire family of bugs. The Tuesday-after-the-deploy bug from chapter 19 — orders takes 41 minutes instead of 22, attribution fires at 02:30 against stale data, the dashboard is wrong by 09:00 — cannot occur in a DAG-based scheduler. The attribution task does not run until orders has emitted a SUCCESS event. If orders takes 41 minutes, attribution starts at 02:41 + 0 seconds. If orders takes 4 hours, attribution starts at 06:00. If orders fails, attribution does not start at all; the on-call engineer receives a single page about orders, fixes it, reruns, and attribution then runs against correct data. The dashboard is late but right, instead of on-time but wrong.

The graph also makes parallelism free. Two tasks with no path between them can run simultaneously without the operator having to think about it. In a cron stack, parallelising two extract jobs requires explicitly staggering them in the crontab and hoping they don't overload the source. In a DAG stack, the executor sees the two tasks have no edge between them and dispatches both — bounded only by a configured concurrency cap, which the scheduler enforces uniformly across the DAG.

The dispatch rule is one line. State of a task is a function of the states of its parents. Once that's modelled, the executor's job is mechanical — sweep the graph, find ready tasks, dispatch them subject to a concurrency cap.

The DAG also makes partial reruns trivial. If aggregate_revenue fails because of a SQL error, the operator fixes the SQL and triggers a rerun just for that task and its descendants. The graph tells the scheduler exactly which downstream tasks to re-execute. In a cron stack, a partial rerun is a manual exercise — the operator has to reason about which scripts read which tables, manually invoke the failed script, manually invoke the loader, and hope nothing else also reads from the table being rebuilt mid-run.

Building a tiny DAG executor

The smallest faithful DAG executor is about 80 lines of Python. The data structure is dict[str, list[str]] — a mapping from task name to its parents. The executor maintains a state: dict[str, str] mapping each task to one of PENDING | RUNNING | SUCCESS | FAILED | UPSTREAM_FAILED and dispatches tasks whose parents are all SUCCESS.

# tiny_dag.py — execute a DAG, track state, respect concurrency.
import time, threading, queue, random
from dataclasses import dataclass, field
from typing import Callable, Dict, List

@dataclass
class Task:
    name: str
    fn: Callable[[], None]
    parents: List[str] = field(default_factory=list)

class DAG:
    def __init__(self, tasks: List[Task], max_parallel: int = 3):
        self.tasks = {t.name: t for t in tasks}
        self.state = {t.name: "PENDING" for t in tasks}
        self.max_parallel = max_parallel
        self.lock = threading.Lock()
        self.done = threading.Event()

    def ready(self):
        with self.lock:
            return [n for n, s in self.state.items() if s == "PENDING"
                    and all(self.state[p] == "SUCCESS" for p in self.tasks[n].parents)]

    def transition(self, name, new_state):
        with self.lock:
            self.state[name] = new_state
            if new_state == "FAILED":
                # mark all reachable descendants as UPSTREAM_FAILED
                stack = [name]
                while stack:
                    cur = stack.pop()
                    for n, t in self.tasks.items():
                        if cur in t.parents and self.state[n] == "PENDING":
                            self.state[n] = "UPSTREAM_FAILED"
                            stack.append(n)

    def run_one(self, name):
        self.transition(name, "RUNNING")
        print(f"  -> {name} RUNNING")
        try:
            self.tasks[name].fn()
            self.transition(name, "SUCCESS")
            print(f"  -> {name} SUCCESS")
        except Exception as e:
            self.transition(name, "FAILED")
            print(f"  -> {name} FAILED: {e}")

    def run(self):
        running = []
        while True:
            running = [t for t in running if t.is_alive()]
            ready = self.ready()
            while ready and len(running) < self.max_parallel:
                t = threading.Thread(target=self.run_one, args=(ready.pop(0),))
                t.start(); running.append(t)
            if not running and not self.ready():
                break
            time.sleep(0.05)

# Realistic usage: the revenue dashboard DAG from the figure above
def fake(name, secs, fail=False):
    def _f():
        time.sleep(secs)
        if fail: raise RuntimeError(f"{name} hit a SQL error")
    return _f

tasks = [
    Task("extract_orders",   fake("ext_o", 0.4)),
    Task("extract_payments", fake("ext_p", 0.5)),
    Task("clean_orders",     fake("cln_o", 0.3), parents=["extract_orders"]),
    Task("clean_payments",   fake("cln_p", 0.3), parents=["extract_payments"]),
    Task("join_orders_pmt",  fake("join", 0.6),
         parents=["extract_orders", "extract_payments"]),
    Task("aggregate_revenue", fake("agg", 0.4),
         parents=["clean_orders", "clean_payments", "join_orders_pmt"]),
    Task("load_dashboard",    fake("load", 0.2),
         parents=["aggregate_revenue"]),
]

DAG(tasks, max_parallel=3).run()
print("final:", {n: s for n, s in DAG(tasks).state.items()})

Sample run on a developer laptop:

  -> extract_orders RUNNING
  -> extract_payments RUNNING
  -> extract_orders SUCCESS
  -> clean_orders RUNNING
  -> extract_payments SUCCESS
  -> clean_payments RUNNING
  -> join_orders_pmt RUNNING
  -> clean_orders SUCCESS
  -> clean_payments SUCCESS
  -> join_orders_pmt SUCCESS
  -> aggregate_revenue RUNNING
  -> aggregate_revenue SUCCESS
  -> load_dashboard RUNNING
  -> load_dashboard SUCCESS

Lines 7–10 (Task dataclass): a task is a name, a callable, and a list of parent names. That is it. The callable is the work; the parents list encodes the dependency. In Airflow this struct is called an Operator; in Dagster it is an op or asset; the structural content is the same — name, work, parents.

Lines 22–25 (ready method): this is the dispatch rule from the figure above, expressed in five lines of Python. A task is ready if its own state is PENDING and every parent is in state SUCCESS. The all(...) is the formal expression of "wait until everyone is done"; the s == "PENDING" guard prevents a task from being dispatched twice. Why both conditions must be checked, not just the parents: between the time ready() is called and run_one begins running, the same task could be returned by another ready() call. The state-machine guard ensures only one transition PENDING → RUNNING happens per task. This is the same pattern Airflow's scheduler uses, where TaskInstance.state lives in Postgres and the transition is enforced by a row-level lock.

Lines 27–37 (transition): state changes are atomic under the lock, and a FAILED propagates downstream by marking every reachable descendant as UPSTREAM_FAILED. The propagation is what stops the executor from running tasks against missing data. Why a separate state UPSTREAM_FAILED rather than just leaving the descendants as PENDING: the on-call view distinguishes "this task did not run because its own logic failed" from "this task did not run because something upstream failed". The two require different fixes — fix the failing task and rerun, versus rerun the whole subgraph after a fix elsewhere. Conflating them is one of the more annoying patterns in homegrown schedulers.

Lines 47–55 (run loop): the executor's main loop. Find ready tasks, dispatch up to max_parallel, sleep briefly, repeat until no tasks are running and none are ready. The sleep is 50 ms — short enough that the loop feels responsive on small DAGs, long enough that it does not eat CPU. Production schedulers replace this with an event-driven model (a task transition fires a callback that re-runs ready()), but the polling loop is correct and fits in 10 lines.

Lines 76–80 (DAG construction): the seven tasks of the revenue dashboard, with their parents. This is the executable form of the SVG above. Notice that aggregate_revenue has three parents, two of which (clean_orders, clean_payments) come from sibling cleaning tasks, and one of which (join_orders_pmt) comes from a parallel join branch — the executor handles fan-in transparently, dispatching aggregate_revenue only when all three are SUCCESS.

What this 80-line executor does not do: persist state across crashes (the in-memory state dict is lost on restart), distribute work across multiple worker machines (the threading model is single-process), expose a UI (state is printed to stdout only), or implement retries (failures are terminal). Those are chapters 21 through 24. The mechanism you have at 80 lines is already strictly more powerful than cron at 4,000 lines for the dependency case.

An aside on state propagation timing. Notice that the executor uses a 50 ms polling interval. On a small DAG that finishes in a few seconds, polling is fine — the worst-case dispatch lag is 50 ms, which is invisible against task runtimes measured in seconds. On a production DAG with hundreds of tasks, polling becomes wasteful: the loop wakes up 20 times per second to check states that mostly have not changed. Production schedulers replace polling with an event-driven model: each transition() call fires a callback that wakes a single dispatcher thread, which runs ready() exactly once per state change. The total CPU work is the same; the wakeups drop from "20 × seconds-of-runtime" to "exactly one per task transition". Airflow's scheduler does this via Postgres LISTEN/NOTIFY (or polling, depending on version). Dagster does this via its in-memory job state machine. The architectural lesson: a dispatch rule that depends only on state should be triggered by state changes, not by a clock. Chapter 21 walks through the rewrite.

A cycle test extends the executor by ten lines:

def detect_cycle(tasks):
    indeg = {t.name: 0 for t in tasks}
    children = {t.name: [] for t in tasks}
    for t in tasks:
        for p in t.parents:
            indeg[t.name] += 1
            children[p].append(t.name)
    queue = [n for n, d in indeg.items() if d == 0]
    seen = 0
    while queue:
        n = queue.pop(); seen += 1
        for c in children[n]:
            indeg[c] -= 1
            if indeg[c] == 0: queue.append(c)
    if seen != len(tasks):
        bad = [n for n, d in indeg.items() if d > 0]
        raise ValueError(f"cycle involves: {bad}")

Add detect_cycle(tasks) to DAG.__init__ before any execution begins. A user who defines Task("A", parents=["B"]) and Task("B", parents=["A"]) will see cycle involves: ['A', 'B'] at construction time, not deadlock at runtime. Kahn's algorithm in iterative form, ten lines, fifty-microsecond runtime on a 1000-task DAG. Why this belongs in __init__ rather than in the run loop: a cycle in the graph is a programming error, not a runtime condition. The right time to surface it is the earliest moment the program can — DAG construction. Deferring the check to runtime means the failure mode is "scheduler hangs forever waiting for a node whose parent is itself", which is the worst possible signal. Failing loud and early is always preferable.

Edges encode contracts, not just ordering

There is a subtler property of the DAG that becomes visible only after you have run it for a few weeks: an edge is more than "B comes after A". An edge is a contract — A promises to produce some specific output (a table, a file, a topic message) in some specific shape, and B reads that output. When the contract holds, the edge holds; when it breaks, the edge does too, regardless of whether the executor's state machine notices.

Consider Aditi's bug at a Bengaluru fintech, which appears twice in this curriculum. Her clean_orders task started writing the order_id column as a string instead of an integer after a routine refactor. The DAG said clean_orders → join_orders_pmt, the executor saw clean_orders succeed (it returned exit code 0), and dispatched join_orders_pmt. The join ran. It produced an empty result — every row failed the integer-to-string match silently. The downstream aggregation produced ₹0 of revenue for the day. Nobody noticed until 11 a.m. when the CFO asked why the dashboard was empty. The DAG's state machine said SUCCESS for every node; the data contract had silently broken at the clean_orders → join_orders_pmt edge.

The lesson is that the DAG correctly handles task failures (a task that throws an exception transitions to FAILED) but not data failures (a task that produces output of the wrong shape still says SUCCESS). Modern schedulers add a layer on top — Airflow's data-aware scheduling (Dataset objects, since 2.4), Dagster's asset checks, dbt's tests — that re-asserts the contract at each edge. The check is small (a SQL assertion, a schema check), but it elevates the edge from "ordering" to "ordering + contract", which is what production teams actually need. Chapter 27 covers this in detail. For now, hold the thought: the DAG you build today encodes ordering; the DAG you run in production also encodes contracts.

Why the contract layer is separate from the dispatch layer rather than merged with it: the dispatch logic ("when to run B") is generic across all DAGs — it only needs to know parent state. The contract logic ("does A's output look right") is specific to each edge — the schema of a payments table is not the schema of an orders table. Mixing the two would force the scheduler to know about every edge's data shape; separating them lets the scheduler stay generic and lets each edge declare its own checks. This is the same separation that gives dbt test and Great Expectations their reason to exist.

What the DAG abstraction does not give you

A DAG is not a free lunch. Three things the abstraction does not solve, that the next chapters address:

Cycles need to be detected. A user who writes parents=["B"] on task A and parents=["A"] on task B has a cyclic graph; a DAG executor must reject this at parse time, before dispatch begins. Detection is a topological sort that fails if any node remains undispatched after the sweep — about ten more lines of code, added in chapter 21.
Dynamic graphs. Some pipelines need the shape of the DAG to depend on runtime data ("for each merchant in the merchant list, run a per-merchant pipeline"). A static DAG cannot express this directly. Airflow added "dynamic task mapping" in 2.3 (2022); Dagster has had DynamicOut since 0.10. Both let a task emit a list at runtime that determines how many copies of a downstream task get dispatched. Chapter 26 covers the mechanism.
Backfills, partial reruns, and data-aware scheduling. A DAG is a runtime structure for one execution. A pipeline runs daily for years; the production question is "what state is run-of-2026-04-23 in across all DAGs", not "what state is the DAG executor in right now". Chapter 24 (the scheduler UI) and chapter 28 (data-aware scheduling) close that gap by adding a run abstraction on top of the DAG.

The DAG is the right abstraction for the dependency relation between tasks within one run. It is the foundation, not the whole building.

Common confusions

"A DAG is the same as a workflow." Workflow is the marketing term; DAG is the data structure. A workflow product (Airflow, Dagster, Prefect) is built around a DAG executor, plus a scheduler, plus a metadata database, plus a UI. The DAG is what coordinates task execution within one run; the rest of the product handles repeated runs over time, observability, and operational ergonomics. Calling Airflow "a workflow tool" is correct; calling it "a DAG" is reductive.
"DAGs cannot have loops, so they cannot express iteration." They cannot have cycles, but they can express bounded iteration through static unrolling — a for batch in 1..N: process(batch) becomes N parallel tasks in the DAG. They cannot express unbounded iteration ("keep retrying until success") within a single run, which is exactly why retry semantics live at the executor level (chapter 23) and not as a graph edge.
"A DAG forces strict serialisation, killing parallelism." The opposite: the DAG enables parallelism by making concurrent tasks explicit. Two tasks with no path between them are guaranteed safe to run simultaneously; the executor exploits this without the operator having to reason about it. cron forces serialisation by accident — two crontab entries at the same minute lead to undefined ordering and the operator avoids the problem by staggering them.
"Airflow invented the DAG." The DAG-as-pipeline-abstraction predates Airflow by decades. Make (1976), Tivoli Workload Scheduler (1990s), Pinball at Pinscope, Luigi at Tunestack (2012), Azkaban at ProfNet (2009) — all used DAGs. Airflow (2014, open-sourced from Stayloft in 2015) packaged the DAG with a Python DSL, a scheduler-database split, and a web UI in a way that won the market, but the abstraction itself was old.
"My pipeline is a sequence of three steps, so a DAG is overkill." Three steps in a chain is A → B → C, which is a DAG. Adopting the abstraction is not over-engineering — it costs nothing because the linear case is the trivial case. The cost shows up only if you adopt a workflow tool (Airflow's overhead is real for tiny pipelines); a 200-line homegrown DAG executor (chapter 21) has near-zero overhead and gives you the abstraction.
"DAGs are static, so they cannot model streaming." Streaming pipelines model the topology of operators as a DAG (the Flink JobGraph, the Kafka Streams topology, the Beam pipeline DAG) — what is dynamic is the data flowing through it, not the graph itself. Chapters in build 8 cover stream-processing topology in detail.

Going deeper

Topological sort: why the schedule order is the linearisation of the graph

The classical algorithm Kahn (1962) gives any DAG a sequence of nodes such that every edge goes left-to-right. The algorithm: repeatedly find a node with no incoming edges, output it, remove it (and its outgoing edges) from the graph, and continue until the graph is empty. If the algorithm halts before the graph is empty, the graph has a cycle. A scheduler's ready() function is the iterative form of this algorithm — at each step, the set of "ready" tasks is exactly the set of nodes whose incoming edges all point to already-completed nodes. The two formulations are equivalent: Kahn produces a linearisation in advance; a scheduler produces it lazily as tasks complete. The lazy form is what allows parallelism (multiple ready tasks dispatched simultaneously) while preserving the dependency order. The complexity of either form is O(V+E) where V is task count and E is edge count — fast enough that the topological sort is essentially free even on DAGs with thousands of tasks.

Real-system tie-in: PaisaBridge's payments-attribution DAG at 2 a.m.

PaisaBridge's data platform runs about 280 daily tasks in Airflow as of 2024, organised into roughly 15 logical DAGs. The largest single DAG is the daily payments-attribution flow that joins PaisaBridge's order events (from the production Postgres via Debezium CDC) with their payment processor's settlement reports (which arrive via SFTP between 01:00 and 04:00 IST), producing the daily reconciliation report that finance and the GSTN filing pipeline both consume. The DAG has about 60 tasks: 7 source extracts, 22 cleaning and validation tasks, 12 dimensional joins (with merchants, banks, payment methods), 14 aggregations (by merchant tier, city, payment method, RBI report category), and 5 sinks (warehouse, GSTN feed, finance dashboard, internal API, downstream feature store). The DAG runs every day at 02:00 IST with start_date=2018-07-01 and zero gaps in run history — a property cron cannot give. When a settlement file is late on a Saturday morning, the SFTP-sensor task waits up to 4 hours; if it times out, the downstream tasks become UPSTREAM_FAILED and the on-call SRE pages once, not 40 times. The DAG abstraction is what makes that single-page outcome possible.

Why "directed" matters more than people realise

Consider what would change if the dependency relation were undirected — just "A and B are related, do them in some order". The executor could pick either A-then-B or B-then-A. For independent tasks this is fine. For data dependencies it is catastrophic — running a join before its inputs produces an empty join, which silently produces an empty downstream report. The operator would have to remember that "by convention, extract_orders comes before join_orders_pmt" and would inevitably get it wrong on month-end when an ad-hoc rerun has to be done in a hurry. The directed edge is the formal expression of an asymmetric relation. Direction is what stops the scheduler from picking a wrong order; it is not a syntactic decoration on the graph.

Cross-DAG dependencies and the "DAG of DAGs" problem

A single DAG handles dependencies within one logical pipeline. Production teams quickly discover they have cross-DAG dependencies — the marketing-attribution DAG reads tables produced by the orders DAG, which reads tables produced by the CDC DAG. Naively unioning them into one giant DAG is wrong: the three DAGs run on different schedules (CDC is continuous, orders is hourly, attribution is daily), and merging them forces the slowest schedule on everyone. Airflow's solution is ExternalTaskSensor — a sensor task that polls the metadata database for "did task X in DAG Y succeed for run-date D" and proceeds when it has. Dagster's solution is the asset graph — assets are global, so the cross-DAG dependency is just an edge in the global asset graph. Both approaches solve the same underlying issue: a DAG is one execution-time graph, but a system of pipelines is a graph of graphs. The system-level graph is itself a DAG; the question is just whether you express it explicitly (Dagster) or implicitly through sensors (Airflow). Chapter 25 returns to this in depth — for now, recognise that "DAG of DAGs" is itself a DAG, and the same dispatch rule applies one level up.

How Airflow, Dagster, and Prefect each take the same DAG abstraction in different directions

Airflow's DAG is a Python object whose nodes are Operators. Dagster's DAG is implicit — it is computed from the data dependencies declared between assets, where an asset is a function that produces data and lists the assets it reads. Prefect's DAG is dynamic — the graph is constructed each run by recording which futures depend on which other futures. The three approaches are equivalent in expressive power for static DAGs; they diverge on how dynamic graphs are expressed (Dagster's DynamicOut, Airflow's task mapping, Prefect's native dynamic flow), on lineage (Dagster's asset-first model gives column-level lineage for free; Airflow needs the OpenLineage plugin), and on developer ergonomics (Dagster's local-first dev loop is fastest; Airflow's UI is most mature; Prefect's deployment model is most cloud-native). All three sit on the same DAG abstraction. Choosing between them is choosing among the layers built on top, not among different abstractions.

Where this leads next

Writing a DAG executor in 200 lines — chapter 21, the executor that runs the DAG with persistence and crash-recovery
Task dependencies: wait-for, fan-out, fan-in — chapter 22, the graph patterns that show up over and over
Retries, timeouts, and poisoned tasks — chapter 23, closing flaw 2 from the previous chapter
The scheduler UI: timelines, logs, retries — chapter 24, where the run abstraction lives

By chapter 32, the in-memory DAG executor of this chapter has grown into a persistent, distributed, observable scheduler — and the path through chapters 21–32 is exactly the path Airflow, Dagster, and Prefect each took historically. Reading their source after that point becomes a matter of mapping their code to the abstractions you have already built, not of decoding a foreign architecture.

Two notes for readers who plan to adopt one of those tools directly: first, none of them prevents you from understanding the DAG underneath — every one exposes a parents (or equivalent) field on its task definition. Second, the abstraction in this chapter is the smallest you need; the tools layer scheduling, persistence, distribution, and UI on top, but the dispatch rule at the core is the same one-liner from the figure earlier.

References

Kahn (1962): Topological sorting of large networks — the classical algorithm that every DAG executor uses; two pages, still relevant.
Airflow 2.x scheduler architecture — the canonical reference for a production-grade DAG executor.
Dagster: assets and the asset graph — the asset-first reformulation of the DAG.
Prefect 2: flows, tasks, and dynamic DAGs — the dynamic-DAG approach.
Luigi: workflows in pure Python — Tunestack's 2012 DAG framework; predates Airflow and is still in use.
PaisaBridge engineering: data-platform talks 2022–2024 — public talks on the payments-attribution DAG architecture.
cron: the simplest scheduler and its three flaws — chapter 19, the motivation this chapter resolves.
Apache Beam: pipeline as DAG — the streaming-graph variant of the same abstraction.

The summary in one sentence: a directed acyclic graph is the data structure that exactly fits "task B depends on task A's output", and adopting it as the dispatch primitive is what closes cron's first structural flaw — wall-clock dispatch in place of dependency dispatch. The DAG itself is roughly 30 lines of Python; an executor that runs it correctly is another 50; the scheduler that runs the executor in production is the next twelve chapters. The reader who has internalised the DAG abstraction can read Airflow's source, Dagster's source, or Prefect's source and recognise each system's design choices as variations on the same core — not as separate brand-named tools to be learned independently. The abstraction is the leverage; the tool that wraps it is a packaging detail.

One last calibration. The DAG is a data structure, not a religion. A team running 3 cron jobs that genuinely have no dependencies between them does not need Airflow; they need to keep the cron file readable. A team running 30 cron jobs with implicit data dependencies needs the DAG abstraction yesterday — even if they implement it themselves in 200 lines of Python rather than adopting a workflow tool. The threshold is not the number of jobs; it is the number of edges between jobs. One job with no edges is fine in cron. Five jobs with twelve edges between them is already a DAG, whether or not anyone has drawn it on a whiteboard. The migration question is "do my jobs have edges?" not "do I have a lot of jobs?" — and once the answer is yes, the abstraction follows.

A practical exercise at this point: take your current pipeline (whatever shape it has — cron, shell scripts, ad-hoc Python) and draw the dependency graph on paper. Nodes are tasks; arrows are "B reads A's output". Look for cycles (almost always a sign that a task is reading and writing the same table — separate it into two tasks). Look for hidden parents (a task that reads a table nobody declared as its parent — the dependency exists in the data even if not in your scheduler). Look for unnecessary serialisation (two tasks that could run in parallel because they have no path between them). The graph you draw is the DAG you should have. Comparing it to what you actually run in cron is the most direct way to see what the migration is about. Most teams who do this exercise discover their cron file under-declares dependencies by 30–50% — the data has more parent-child edges than the schedule implies, and that gap is precisely where the 3 a.m. pages come from. Closing the gap is what adopting a DAG-based scheduler does for you, before you write a single line of Airflow.