Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

cron: the simplest scheduler and its three flaws

Open crontab -e on any Linux box you have ever touched. The file is six lines of comments, a syntax cheat sheet, and a list of 0 2 * * * /usr/local/bin/extract.sh rows. That is a scheduler. It runs in 22 KB of memory, ships in every Unix distribution since 1975, has zero dependencies, and is arguably the single most-used piece of data-engineering infrastructure on the planet. It is also wrong for almost every modern data pipeline — not because the code is buggy, but because three properties built into its interface make it structurally unable to do the job. This chapter names those three properties.

cron's interface — <wall-clock time> <command> — encodes three structural choices: dispatch by time (not by dependency), no retry semantics on failure, no observability of run state. Every other scheduler (Airflow, Dagster, Prefect, Temporal) is a response to one or more of these three. Naming them precisely is what lets you evaluate any scheduler tool by mapping its features back to the gap it closes.

What cron actually is

cron is a daemon that wakes up once per minute, reads /var/spool/cron/<user> and /etc/cron.d/*, and forks any command whose schedule expression matches the current minute. That is the entire mechanism. The schedule expression is a five-field grammar: minute (0–59), hour (0–23), day-of-month (1–31), month (1–12), day-of-week (0–7). A * in any field means "every value". So 0 2 * * * means "minute 0 of hour 2 of every day" — fire at 02:00 every day.

The daemon has no concept of state across runs. It does not remember whether yesterday's invocation succeeded or failed. It does not know if a previous invocation is still running when the next one fires. It does not know which command produced which output. It opens a fresh shell, sets a minimal environment (PATH=/usr/bin:/bin, SHELL=/bin/sh, no HOME unless the user owns the crontab), forks the command, and forgets about it. If the command writes to stdout or stderr, cron emails the output to the local user via sendmail. On a 2026-era cloud VM where sendmail is not configured, that email goes to /dev/null.

cron is a four-step loop. The fourth step — forget — is where every later scheduler diverges. Airflow, Dagster, and Prefect all replace "forget" with "write run state to a database".

This is a beautiful design for the problem cron was originally built to solve in 1975: nightly backups, log rotation, monthly billing runs. Each of those is a single command, independent of every other command, run on a schedule, where failure means "tomorrow's run will work again". Vixie cron — the implementation that ships with most Linux distributions today — is about 4,000 lines of C and has shipped essentially unchanged since 1992. It is one of the most stable pieces of software in production use anywhere.

Why this matters for understanding the rest of the chapter: cron's three flaws are not bugs in Vixie cron. They are properties of the interface — <time> <command> — that cron exposes. Any scheduler that exposes the same interface inherits the same flaws. systemd timers, Windows Task Scheduler, Kubernetes CronJob, and AWS EventBridge cron(...) rules all expose variants of the same <time> <command> interface and therefore have the same three structural gaps. The flaws are about the shape of the interface, not the quality of the implementation.

Flaw 1: dispatch is by time, not by dependency

The first and most consequential flaw: cron decides when to run a job by consulting a wall clock, not by consulting the state of the data the job depends on. The crontab line 30 2 * * * /etl/attribution.sh says "run at 02:30". It does not say "run after the orders extract has finished". The two are different statements; in any pipeline with dependencies, only the second statement is correct.

The operator translates between these two statements in their head. They know orders typically takes 22 minutes, so they schedule attribution at 02:30 — eight minutes of buffer past the 02:22 expected finish. The schedule is a guess at the dependency. The guess is right most days. It is wrong on the specific days that matter most: month-end (more orders, slower extract), Mega Bargain Days (10× normal volume), the day after a backend deploy that added a slow join (the Aditi war story from chapter 18). On those days the attribution job fires while orders is still running, reads yesterday's snapshot of the orders table, produces a wrong attribution number, and the morning standup gets a wrong dashboard.

The structural property here is that cron's dispatch decision is state-blind. It checks the clock; it does not check whether the upstream data is ready. A real scheduler is state-aware: it dispatches a task when the upstream tasks have completed (and optionally when their output passes a quality check). The difference is not 30 minutes of buffer; it is whether the dispatch decision can ever be wrong.

The same 41-minute orders run produces wrong data under cron and correct data under a DAG scheduler. The difference is whether the dispatch decision is keyed on a wall clock or on an upstream completion event.

There is a tempting workaround: make attribution.sh poll for a sentinel file written by orders.sh on completion. That works for one dependency. It does not work for fan-in — a job with two parents needs to poll for two sentinels and decide what to do if one is present and the other is not. It does not work for backfills — running yesterday's attribution after fixing a bug requires manually deleting and recreating sentinels for the right historical date. It does not work for partial dependencies — "run only if rows exist in the orders table" is not encodable as "sentinel file exists". Sentinel-file polling is the bash-shaped approximation of dependency-based dispatch, and the approximation breaks at exactly the cases where dependency-based dispatch is most valuable.

Why "state-aware vs state-blind" is the right framing rather than "smart vs dumb scheduler": cron is not unintelligent — it is intelligent at a different problem (clock-based dispatch). The problem is interface-shape. Any scheduler whose interface is <time, command> cannot express dependency. Any scheduler whose interface is <dependency, command> can express clock-based dispatch by treating the clock as a virtual upstream. The dependency interface is strictly more expressive; the clock interface is strictly less expressive; the latter is a special case of the former. cron's flaw is taking the special case as the primary interface.

Flaw 2: there is no retry semantics

The second flaw: when cron forks a command and the command exits non-zero, cron does nothing. It does not record the failure. It does not retry the command. It does not alert anyone, beyond the email-to-/dev/null mechanism described above. From cron's perspective, the command ran and the world moved on; the next 02:00 fire will run a fresh invocation, and that fresh invocation will see whatever state the failed run left behind.

The reader's first instinct is to put a retry loop inside the command. That works for transient TCP errors:

for attempt in 1 2 3; do
  curl --fail https://api.razorpay.com/orders > /tmp/orders.json && break
  sleep 60
done

Three retries, 60-second sleep, succeed-and-break. This is the bash-shaped approximation of retry semantics, and it works for one specific failure mode: transient network errors that resolve in under three minutes. It is wrong for every other failure mode.

A 429 rate-limit response should back off exponentially with jitter, not retry every 60 seconds (the second retry is more likely to be rate-limited than the first). A 401 auth error should fail fast — retrying with the same expired token will not produce a different result. A 500 server error should retry but cap retries based on time, not count — retrying a 500 for 30 minutes is fine; retrying it for six hours blocks downstream tasks. An OOM kill is not catchable by the bash loop at all (the process is dead before the || fires). A segfault, ditto. A network-partition that takes the upstream offline for 20 minutes will exhaust the three retries in three minutes and then fail permanently — when a wait-and-retry would have succeeded.

Retry strategies that are correct in production use the failure mode to pick the strategy. The taxonomy is roughly:

Failure mode	Right retry strategy
Transient TCP / DNS	3 retries, 1s/4s/16s exponential backoff with jitter
HTTP 429 rate-limited	Honour `Retry-After` header; otherwise exponential with jitter capped at 60s
HTTP 5xx	Retry with cap on total time, not count; 15-minute budget
HTTP 4xx (auth, validation)	No retry — fail fast
Disk full	No retry — fail fast, alert SRE
OOM kill	Retry once with reduced batch size; otherwise fail
Network partition	Long retry window (1 hour+) with backoff; eventual fail
Upstream still warming up	Sensor pattern: poll source, dispatch when ready

cron has no concept of any of this. It has no concept of retry at all. The retry logic has to be embedded in the command. Embedding it in the command means six different jobs end up with six slightly different retry implementations, none tested, all maintained by whoever wrote them last. The structural property is that retry semantics belong in the scheduler, not in the job, because retries depend on cross-cutting concerns (max parallel retries, total retry budget across the DAG, dead-letter queue routing) that the job cannot see.

A real scheduler exposes retries declaratively:

@task(retries=3, retry_delay=timedelta(minutes=5),
      retry_exponential_backoff=True, max_retry_delay=timedelta(hours=1))
def extract_orders():
    ...

The scheduler implements the retry loop, the backoff, the jitter, the dead-letter routing, and the alerting. The task author writes the task. The cross-cutting concerns are written once, tested once, and applied uniformly to every task in the system.

Why retry-in-the-job is structurally worse than retry-in-the-scheduler even if the code happens to be correct: a retry that lives inside a task cannot communicate with retries elsewhere in the DAG. If task A is retrying because of a 429 from a shared upstream, and task B (in the same DAG) is also retrying because of a 429 from the same upstream, the two retry loops compound — they both keep hammering the upstream, making the rate-limit worse. A scheduler-level retry can apply a global rate-limit policy (no more than N retries against this upstream per minute) and let the rest of the DAG wait. The architecture-level fix requires the retry to be observable outside the task, which means it has to live in the scheduler.

Flaw 3: there is no observability of state

The third flaw: cron does not store run state. There is no record of "this job ran on this day at this time, took N seconds, exited with code C, produced these log lines". The on-call engineer who needs to answer "did orders.sh run last Tuesday at 02:00?" has to reconstruct the answer from side effects: was there a row in the warehouse with created_at = 2026-04-21 02:00? Did /var/log/etl/orders.<timestamp>.log exist? Did the email-to-the-user-account come through? Each of these is an indirect signal. None of them are "the scheduler told me".

The engineering-effort cost of this flaw is the smallest of the three; the human-effort cost is the largest. Reconstructing pipeline state under time pressure is the part of being on-call for a cron-based stack that ages people. The on-call engineer at 3:14 a.m., woken from REM sleep, has to:

ssh to the right VM (was it etl-prod-1 or etl-prod-2 after the migration last quarter?)
crontab -l to see what was supposed to run, when
ls -la /var/log/etl/ to see what actually ran, sorted by timestamp
grep ERROR /var/log/etl/orders.20260421-020023.log to find the failure
cross-reference with the destination warehouse to see what data did and didn't land
mentally translate between "wall-clock 02:00" and "logical execution date 2026-04-21"
decide whether to manually rerun, restore from snapshot, or wait for tomorrow

A real scheduler shows all of this in a UI: every DAG run is a row with start time, end time, status, duration, log link, lineage, and a one-click rerun button. The same task that took 25 minutes at 4 a.m. on a cron stack takes 90 seconds on an Airflow stack. The difference is not Airflow being magical; it is that the run state was recorded rather than reconstructed.

The left panel is the cron on-call experience. The right panel is the scheduler-UI on-call experience. The 25× speed difference is observability, not raw performance.

The deeper consequence of no observability is that the system has no concept of "logical run". cron fires orders.sh at 02:00 every day; whether the 02:00 invocation today is a different logical entity from yesterday's is not represented anywhere in cron. Airflow calls this the execution_date (or logical_date in Airflow 2.2+); Dagster calls it the partition_key; Prefect calls it the flow_run_id. The concept's name varies; the concept itself is the same — a stable identifier for "today's run" that survives reruns, backfills, and partial failures. cron has no such concept. Reruns are a manual exercise; backfills are a multi-hour scripting job; partial failures leave the operator no way to ask "what was the state at 02:14 yesterday?".

Building a tiny scheduler that closes one of the three flaws

To feel the gap between cron and a real scheduler, build the smallest possible scheduler that closes one of the flaws — flaw 3, observability. The following Python file is 64 lines, runnable, and stores every run as a row in SQLite.

# tiny_scheduler.py — runs jobs on a schedule, records every run.
import sqlite3, subprocess, time, sys, os, json
from datetime import datetime, timezone
from dataclasses import dataclass

DB = "scheduler.db"

@dataclass
class Job:
    name: str
    command: list      # e.g. ["python", "extract_orders.py"]
    interval_seconds: int  # e.g. 86400 for daily

JOBS = [
    Job("orders_extract", ["python", "extract_orders.py"], 86400),
    Job("attribution",    ["python", "attribution.py"],    86400),
]

def init_db():
    con = sqlite3.connect(DB)
    con.execute("""create table if not exists runs (
        id integer primary key autoincrement,
        job text not null, started_at text not null, ended_at text,
        exit_code integer, duration_s real, stdout text, stderr text)""")
    con.commit(); con.close()

def last_success_at(job_name):
    con = sqlite3.connect(DB)
    row = con.execute("""select started_at from runs
        where job=? and exit_code=0 order by id desc limit 1""",
        (job_name,)).fetchone()
    con.close()
    return datetime.fromisoformat(row[0]) if row else None

def run_job(job):
    started = datetime.now(timezone.utc)
    print(f"[{started.isoformat()}] {job.name} starting...")
    proc = subprocess.run(job.command, capture_output=True, text=True)
    ended = datetime.now(timezone.utc)
    duration = (ended - started).total_seconds()
    con = sqlite3.connect(DB)
    con.execute("""insert into runs(job,started_at,ended_at,exit_code,
        duration_s,stdout,stderr) values(?,?,?,?,?,?,?)""",
        (job.name, started.isoformat(), ended.isoformat(),
         proc.returncode, duration, proc.stdout[-2000:], proc.stderr[-2000:]))
    con.commit(); con.close()
    print(f"[{ended.isoformat()}] {job.name} exit={proc.returncode} ({duration:.1f}s)")

def tick():
    now = datetime.now(timezone.utc)
    for job in JOBS:
        last = last_success_at(job.name)
        if last is None or (now - last).total_seconds() >= job.interval_seconds:
            run_job(job)

if __name__ == "__main__":
    init_db()
    while True:
        tick()
        time.sleep(60)

A sample run, after pointing the two job commands at small Python scripts and letting it cycle for two days:

$ sqlite3 scheduler.db "select job, started_at, exit_code, duration_s from runs"
orders_extract|2026-04-23T02:00:03+00:00|0|41.2
attribution   |2026-04-23T02:00:44+00:00|0|18.7
orders_extract|2026-04-24T02:00:01+00:00|0|22.5
attribution   |2026-04-24T02:00:23+00:00|2|0.4

Lines 17–22: the JOBS list is the equivalent of crontab — but expressed as Python data instead of a five-field grammar. The data form makes it trivial to add fields like depends_on=["orders_extract"] or retries=3 later; the cron grammar would need a parser change for either.

Lines 24–30 (the init_db function): persistence is the hinge. cron's forget step is replaced with insert into runs. Every run is a row; nothing is lost. The on-call engineer's grep exercise becomes a select * from runs where job='orders_extract' and exit_code != 0.

Lines 32–37 (last_success_at): the scheduler now has a notion of "did this job already run today". The interval_seconds check on line 50 uses this to decide dispatch — meaning this scheduler is closer to dependency-based than cron even though it doesn't yet have task-to-task dependencies. The dispatch keys on observed state, not on a wall clock.

Lines 39–48 (run_job): the run records exit_code, duration_s, and the last 2 KB of stdout/stderr. Two kilobytes is enough for the error message in 95% of cases and avoids unbounded log growth in the database. The full logs go to a file as in cron; the database stores the tail for fast lookup.

Lines 50–55 (tick): the dispatch loop. For each job, ask "is the last successful run older than the interval?" and run if yes. This is the smallest change from cron's interface that closes flaw 3. The change is six lines.

What this 64-line scheduler does not do: handle dependencies between jobs (flaw 1 still wide open), retry on failure (flaw 2 still wide open), expose a UI (the only way to query state is sqlite3), survive its own crash (the loop dies, the scheduler stops). Those are the next four chapters. But even at 64 lines, this scheduler has the property cron lacks: an engineer paged at 3 a.m. can answer "what happened?" in 10 seconds with a single SQL query, instead of 25 minutes with grep.

Why building the tiny scheduler before introducing Airflow matters pedagogically: the reader who copies an Airflow DAG without first having felt the gap between cron and a basic state-tracking scheduler treats Airflow's database, executor, and UI as one undifferentiated brand. The reader who has written 64 lines of Python that just track run state can identify which Airflow features close which flaws — and is therefore equipped to evaluate a different scheduler (Dagster, Prefect, Temporal) by mapping its features to the same three flaws. The vocabulary of mechanisms transfers; the vocabulary of brand names does not.

What the next chapters extend

The 64-line scheduler closes flaw 3 (observability). The remaining flaws need different machinery.

Flaw 1 (dependency dispatch) is closed by the DAG abstraction — the next chapter, The DAG as the right abstraction, introduces directed acyclic graphs as the data structure that replaces both the crontab and the per-job interval. The dispatch decision becomes "the upstream tasks of this task have all completed" instead of "the wall clock matches my interval".

Flaw 2 (retry semantics) is closed by per-task retry policies — chapter 23, Retries, timeouts, and poisoned tasks, formalises the failure-mode-to-strategy table from this chapter into a declarative retry_policy field on the task, with the scheduler implementing the loop and the backoff.

Flaw 3 (observability) is closed by the scheduler-state database that this chapter's tiny scheduler already has, plus a UI on top — chapter 24, The scheduler UI: timelines, logs, retries, explains why the UI is not a finishing flourish but the on-call workflow's primary surface.

By chapter 30, the scheduler the reader has built (in pieces, across 12 chapters) does what Airflow's scheduler does. By chapter 32, the chapter pivots to "now compare with Airflow / Dagster / Prefect" — and the reader who has built each piece reads the production tools' source with comprehension, not awe.

Common confusions

"systemd timers are a better cron and don't have these flaws." systemd timers fix the cron-fork-style implementation (logs go to journald, units have status, dependencies between units are expressible via Requires=/After=). They do not fix the structural flaws — Requires= is start-order, not data-dependency; failure semantics are still per-unit and don't compose into a DAG-level retry policy; observability is per-unit, not per-DAG-run. systemd is a strictly better cron; it is not a scheduler.
"Kubernetes CronJob is a real scheduler." CronJob adds container packaging and concurrency policies on top of cron's interface, but the dispatch decision is still wall-clock-based. It exposes successfulJobsHistoryLimit and failedJobsHistoryLimit, which is a whisper of observability, but the data model is still "fire commands on a schedule", not "execute a DAG". Most teams that adopt CronJob end up running Airflow on top of Kubernetes anyway.
"AWS EventBridge cron(...) rules are managed cron and that's enough." EventBridge fires events on a schedule; the events trigger Lambda or Step Functions or whatever target. For the simple case (one Lambda, fired daily), it works. The dependency-graph case requires Step Functions on top, which means the team has now adopted two schedulers — EventBridge for trigger and Step Functions for orchestration. That is a perfectly reasonable architecture but it is not "cron is enough"; it is "we built a scheduler stack out of two AWS services".
"My cron jobs are independent, so the dependency flaw doesn't apply to me." Independent today, dependent in six months. The most common shape of pipeline drift is: two jobs added at different times by different engineers turn out to read the same upstream table; one of them now silently depends on the other's freshness. The dependency exists in the data even if it isn't expressed in the schedule. Auditing for hidden data dependencies is harder than auditing for declared ones.
"The retry flaw is solved by a for loop in bash." A for loop covers the retriable-transient-error case. It does not cover failure-mode-aware retry, total retry budget across the DAG, dead-letter routing, jitter to avoid thundering herd on a shared upstream, or alerting after the final failure. Each of these is another 10–30 lines of bash that has to be kept in sync across every job.
"The observability flaw is solved by writing logs to S3 and grepping there." S3 + grep solves "where did the log go". It does not solve "what was the run state at 02:14 yesterday across all 12 DAGs", "is this DAG currently running or stuck", or "rerun yesterday's failed task with one click". Logs are the artefact; run state is the index. Dumping artefacts to S3 without an index is a logging system, not an observability system.

Going deeper

The history: Vixie cron, fcron, anacron, and why the three flaws survived 50 years

Bob Vixie's cron (1987) is the implementation that ships with most Linux distributions today. It was a rewrite of the original Brian Kernighan cron from 1975 that added /etc/cron.d directories, environment-variable handling, and the day-of-week/day-of-month "or" semantics. Subsequent reimplementations (fcron in 1999, dcron in 2002) added features — handling missed runs after a hibernation, finer-grained timing — but kept the same <time, command> interface. anacron (1998) added "run jobs that should have run while the machine was off" semantics for laptops, but again the interface is <period, command>, not dependency-based. The interface has survived because for the original use case (system-administration jobs on one machine) it is exactly right; the flaws emerge only when the interface is re-purposed for data pipelines that have inter-job dependencies, retry requirements, and observability needs that single-machine sysadmin work does not.

The "`crontab(5)` grammar" and what it can and cannot say

The five-field grammar — minute, hour, dom, month, dow — has roughly 1.4 trillion expressible schedules (60 × 24 × 31 × 12 × 7 with the , and */N extensions). What it cannot express: "the third Tuesday of the month" (no nth-weekday operator), "the last business day of the quarter" (no calendar awareness), "after job X completes" (no inter-job reference), "any time after 02:00 on the day after a holiday" (no calendar joins), "every 90 minutes starting from 03:00 on Monday" (no anchor-date semantics). Most of these are common business requirements. The dbt-utils package has a last_business_day macro precisely because cron cannot express it. Schedulers that target business workflows (Dagster's scheduling, Airflow's timetable) replace the grammar with code precisely so that arbitrary calendar logic is expressible.

The "missed run" problem and how cron almost-but-not-quite handles it

If a server is down at 02:00 (reboot, hardware failure, kernel upgrade), cron does not run the 02:00 jobs when the server comes back up at 02:15 — it skips them, and the next 02:00 fire is tomorrow. anacron exists specifically to fix this for laptops: if a job tagged daily did not run today, anacron runs it when the machine boots. But anacron is opt-in and does not handle the "missed run during a regional cloud outage" case for production VMs. A real scheduler treats every scheduled run as a logical entity that must eventually complete, and "the scheduler was down at 02:00 yesterday" is a recoverable condition — the run is logged as missed, and on next startup the scheduler decides whether to backfill it. This is a non-trivial property; it is the scheduler-side equivalent of database write-ahead logging.

Indian-context calibration: why the wall arrives at 3 jobs, not 5, in fintech and quick-commerce

The "5 jobs is the cliff" heuristic from chapter 18 is a global median. In Indian fintech (PaisaBridge, DigiPaisa, KreditClub) and quick-commerce (MinuteMart, DashKart, Instamart), the cliff arrives earlier — typically at 3 jobs — because of regulatory dependency chains. The orders-extract job feeds the GST-reconciliation job which feeds the daily MIS report which feeds the management dashboard at 09:00 IST. A failure in any link causes a regulatory submission delay that is visible to the Income Tax department or RBI, not just to the morning standup. The visibility-of-failure asymmetry compresses the wall: a Bengaluru fintech with three cron jobs and one regulatory deadline crosses the wall at exactly the moment the third job is added, even though the engineering complexity is below the global threshold. Teams that map their pipeline against compliance deadlines (GST filings on the 11th and 20th of every month, TDS quarterly returns, RBI daily NEFT reconciliation, SEBI market data submissions) tend to migrate to a real scheduler within the first quarter of building the third pipeline. Teams that don't — typically B2B SaaS without filing deadlines — tolerate the cron stack for years longer.

Why cron is still the right answer for some things forever

A practical caveat: this chapter is not an argument that cron should be replaced everywhere. cron is still the right answer for system-administration tasks on a single machine — log rotation, backup-to-S3, certificate renewal cron, apt-get update for unattended upgrades. The interface fits the problem: each task is independent, failures are tolerable (tomorrow's run will work), observability needs are minimal (the log is enough). Many production systems run a real scheduler for the data pipeline and cron for the maintenance tasks. The mistake to avoid is using cron for the data pipeline because "we already have cron for the maintenance tasks". The two are different scales of problem and deserve different tools. PaisaBridge's data-platform team in 2024 documented exactly this split: Airflow for the 200-task data DAG, cron for the eight maintenance tasks (log rotation on the Airflow cluster itself, certificate renewal, the backup of the metadata database). Each tool fits the problem it was designed for.

Where this leads next

The DAG as the right abstraction — chapter 20, the data structure that closes flaw 1
Writing a DAG executor in 200 lines — chapter 21, the executor that runs the DAG
Task dependencies: wait-for, fan-out, fan-in — chapter 22, the graph patterns
Retries, timeouts, and poisoned tasks — chapter 23, closing flaw 2
The scheduler UI: timelines, logs, retries — chapter 24, closing flaw 3 fully

Build 4 closes the three flaws one by one. By chapter 32, the reader has a working scheduler in 200 lines of Python plus a UI, and the next chapters compare it to Airflow / Dagster / Prefect by mapping each tool's features back to which of the three flaws it addresses. The mental model — flaw → mechanism → tool — is what makes scheduler choice tractable instead of vibes-driven.

References

Vixie cron source code — the canonical implementation; reading the 4,000 lines of C is the fastest way to understand cron's actual semantics.
crontab(5) and cron(8) man pages — the grammar and daemon reference.
Airflow scheduler architecture — the canonical reference for how a real scheduler decomposes the FSM over the DAG.
Dagster scheduling concepts — the asset-first alternative framing.
Prefect 2 scheduling docs — the third-party-comparison reference.
PaisaBridge engineering: data platform 2018–2024 — public-facing post-mortems on the cron-to-Airflow migration.
Wall: hand-rolled scheduling breaks past five jobs — chapter 18, the motivation this chapter formalises.
Locally Optimistic: when to graduate from cron — the canonical industry post on the migration trigger.

The summary in one sentence: cron's interface is <wall-clock time, command>, and that interface forces three structural choices — time-based dispatch, no retry, no observability — which together are the seed of every modern scheduler's feature set. Naming the three precisely is what lets you read Airflow's source, or Dagster's, or Prefect's, and recognise each feature as a closed flaw rather than a brand-named novelty. The 64-line scheduler in this chapter closes one of the three; chapters 20 through 24 close the other two and add the DAG abstraction on top. By chapter 32, you have built a scheduler that does what Airflow's scheduler does — and you can decide for your team's specific load profile whether to ship the homegrown thing, adopt Airflow, adopt Dagster, or adopt something newer, based on which gaps each tool closes for your shape of problem.

A practical exercise at this point: take your current cron file, the wrapper script it calls, and a list of your 3 a.m. pages from the last quarter. Map each page to one of the three flaws — was the page caused by time-based dispatch (a job ran against stale upstream), by missing retry semantics (a transient error was not retried), or by missing observability (the diagnosis took 30 minutes longer than it needed to)? In most teams the distribution is roughly 50% flaw 1, 25% flaw 2, 25% flaw 3 — which is also the rough order of importance and the order in which the next chapters address them. The mapping exercise turns the abstract "cron has flaws" into the concrete "here are the four pages from last quarter that flaw 1 caused", and that concreteness is what makes the migration to a real scheduler a project the team can scope and ship instead of a vague aspiration.

cron: the simplest scheduler and its three flaws

What cron actually is

Flaw 1: dispatch is by time, not by dependency

Flaw 2: there is no retry semantics

Flaw 3: there is no observability of state

Building a tiny scheduler that closes one of the three flaws

What the next chapters extend

Common confusions

Going deeper

The history: Vixie cron, fcron, anacron, and why the three flaws survived 50 years

The "crontab(5) grammar" and what it can and cannot say

The "missed run" problem and how cron almost-but-not-quite handles it

Indian-context calibration: why the wall arrives at 3 jobs, not 5, in fintech and quick-commerce

Why cron is still the right answer for some things forever

Where this leads next

References

The "`crontab(5)` grammar" and what it can and cannot say