Wall: hand-rolled scheduling breaks past five jobs
Aditi runs the analytics stack at a Bengaluru D2C company that sells skincare on Flipkart, Nykaa, and its own site. Eleven months ago she had two cron entries: one to pull yesterday's orders at 02:00, one to refresh the dashboard table at 02:30. It worked. Today there are forty-seven entries in /etc/cron.d/etl, three of which have # TODO: figure out why this fails on Mondays next to them, two of which run the same script with different flags, and one of which she is mildly afraid to delete because nobody remembers what it does. Last Tuesday at 3:14 a.m. the on-call phone rang because the inventory pipeline had been writing to /tmp/inventory.csv while the Tally GST export was reading from the same file — the cron schedules had drifted by 90 seconds over the last quarter and the two jobs now overlapped. The dashboard for the morning standup was wrong. Aditi spent the next two hours writing a 60-line bash script that adds flock calls to every cron entry. It is the third such script she has written this year.
A handful of cron entries works. Past about five interdependent jobs, the gaps between cron and a real scheduler become incidents: jobs cannot wait for each other, retries are bash boilerplate, failures are silent, schedules drift, and the operator has no map of what runs when. This chapter is the wall — the point where adding one more cron entry costs more than building a DAG executor would. The next build (cron → DAG → scheduler) is the response.
Why five jobs is the cliff
Cron is the simplest possible scheduler. It runs a command at a wall-clock time. That is the entire interface — and for one job, or two, or even five independent jobs that don't talk to each other, that interface is the right answer. The complexity that cron does not handle is invisible until the jobs start needing each other.
The progression most teams trace looks like this. Start with one job: pull yesterday's orders, refresh the dashboard. One cron entry. Six months later, marketing wants attribution on top of orders, so a second job reads the orders table and joins it with the ad-spend export. Now there are two entries — and the second has to run after the first. The naive fix is to schedule the second 30 minutes later. It works for a year. Eventually the orders pull starts taking 35 minutes (more orders, slower extract), the attribution job runs against yesterday's orders table because today's hasn't landed yet, and the dashboard for the standup is wrong. Aditi adds # wait at least 5 min between to the cron file as a comment.
Then a third job: forecast next week's demand from the last 90 days of orders. It depends on the orders pull but not on the attribution job. A fourth: GST reconciliation, which reads the orders table and the warehouse's inventory snapshot. A fifth: refund processing, which reads orders and triggers a webhook to the customer-care team. By the time the team has five interdependent jobs, the dependency graph in the team's head looks nothing like the cron file. The cron file says "run at 02:00, 02:30, 02:45, 03:00, 03:15"; the actual dependency graph is a fan-out from the orders pull to four downstream jobs, of which two have additional inputs.
Why five is the rough cliff and not three or ten: with three jobs the operator can hold the dependency graph in their head; with ten the operator has already given up and started writing wrappers. Five is the awkward middle where the operator still believes cron is fine but the schedule has begun lying about reality. The cliff is not a clean number — some teams hit it at four jobs (highly variable runtimes, tight overlapping windows), some at twelve (independent jobs, long runtime budgets, generous gaps). The shared property is the moment when the operator first writes "wait until $other_job_done" in a bash wrapper. That moment is the wall.
The five things cron does not do
Cron does one thing: it runs a command at a wall-clock time. Everything a real pipeline needs around that one thing has to be hand-built. The five gaps below are the five reasons a hand-rolled scheduler stops working past about five jobs. Each gap individually is fixable with a bash wrapper; the wrappers compound.
Gap 1: there is no concept of "wait for another job". Cron has no notion that job B should run after job A. The operator encodes the dependency by scheduling B 30 minutes later than A. When A starts taking 35 minutes, B runs against yesterday's data. The bash fix is to write a wrapper that polls for a sentinel file (/var/run/etl/orders.done) before starting, and have job A touch the sentinel on success. The wrapper is 20 lines per dependency, and every team writes some version of it.
Gap 2: failures are silent. Cron emails the output of the command to the user account that owns the crontab. On a server where the local mail spool is not configured (most cloud VMs in 2026), the email goes nowhere. The cron job that errored at 02:14 leaves no trace anywhere a human would look — no Slack message, no PagerDuty alert, no row in a database. The bash fix is to wrap every command in (command || curl -X POST $SLACK_WEBHOOK -d "job failed") and hope the curl works. Some failures (segfault, OOM kill) bypass the trap entirely.
Gap 3: retries are not built in. A transient network blip during the orders extract leaves the job failed; cron does not retry. The bash fix is for i in 1 2 3; do command && break; sleep 60; done, which works for transient failures but blocks the cron slot — if the retry takes 20 minutes the next scheduled run starts overlapping with this one. Retries that adapt to the failure mode (back off on rate-limit, fail fast on auth error) require parsing exit codes, which means a 50-line wrapper instead of a 5-line one.
Gap 4: schedules drift in real time. A cron entry that says 0 2 * * * runs at 02:00:00 — except when it doesn't. Cron does not guarantee it runs the job; it guarantees it tries. If the system clock is 02:00:00.5 when cron polls, the job is fired; if it's 01:59:59.5, the job is fired one minute later by the next minute-tick. Daylight saving boundaries (which India does not observe but the cloud VM may, if its timezone was misconfigured) cause double-runs or skipped runs. The cron file with 0 2, 30 2, 0 3 looks like jobs run 30 minutes apart, but on a busy host with thousands of users the actual run start can drift by ±1 minute. Over time, "30 minutes apart" becomes "28 minutes apart sometimes", and the second job starts before the first finishes.
Gap 5: there is no map. When the on-call engineer is paged at 3 a.m. with "the dashboard is wrong", they have to open /etc/cron.d/etl in vim, read 47 entries, recall which entry produces which output, recall which entries depend on which, and figure out which one failed. There is no airflow dags list-runs equivalent. The state of the system is implicit in side effects on the filesystem and the destination warehouse. Reconstructing it under time pressure is the part that ages on-call engineers prematurely.
Gap 6 (a corollary): there is no concept of "today's run vs yesterday's run". Cron fires at a wall-clock time and the command runs; whether this invocation is logically distinct from yesterday's is a property the operator has to track manually, usually by parameterising the date into output filenames or table partition keys. Nothing in cron knows that orders.sh at 02:00 today produces a different logical entity from orders.sh at 02:00 yesterday. A real scheduler tracks runs as first-class entities — Airflow's execution_date, Dagster's partition, Prefect's flow_run_id — so that "rerun yesterday's failed orders extract" is a one-line UI action. In cron it is a 20-minute exercise of finding the right environment variables and rerunning the script manually.
What the eleventh cron entry actually looks like
The cleanest way to feel the wall is to read the wrapper that a team typically writes when they cross it. Below is the Aditi-shaped version — a single shell script that wraps every cron entry. It is real, it works, and it is what teams have to write because cron itself does none of it. Then walk through what is missing.
#!/bin/bash
# /etl/run.sh — the wrapper every cron entry calls.
# Usage: run.sh <job_name> <command...>
set -uo pipefail
JOB="$1"; shift
LOCKDIR="/var/run/etl"
LOG="/var/log/etl/${JOB}.$(date +%Y%m%d-%H%M%S).log"
SENTINEL="${LOCKDIR}/${JOB}.done"
SLACK="https://hooks.slack.com/services/T0/B0/XXX"
mkdir -p "$LOCKDIR" "$(dirname "$LOG")"
# Lock — prevent overlap with a previous run still in progress.
exec 9>"${LOCKDIR}/${JOB}.lock"
if ! flock -n 9; then
curl -s -X POST "$SLACK" -d "{\"text\":\"[$JOB] previous run still active, skipping\"}"
exit 0
fi
# Wait for upstream sentinels (encoded per-job, hardcoded here).
case "$JOB" in
attribution|forecast|gst|refunds)
for i in $(seq 1 30); do
[[ -f "${LOCKDIR}/orders.done" ]] && break
sleep 60
done
if [[ ! -f "${LOCKDIR}/orders.done" ]]; then
curl -s -X POST "$SLACK" -d "{\"text\":\"[$JOB] orders never landed, aborting\"}"
exit 2
fi ;;
esac
# Run with retries, capturing logs.
START=$(date +%s)
for ATTEMPT in 1 2 3; do
echo "=== attempt $ATTEMPT at $(date -Iseconds) ===" >> "$LOG"
if "$@" >> "$LOG" 2>&1; then
touch "$SENTINEL"
DURATION=$(( $(date +%s) - START ))
curl -s -X POST "$SLACK" -d "{\"text\":\"[$JOB] ok in ${DURATION}s\"}"
exit 0
fi
sleep $((ATTEMPT * 60))
done
curl -s -X POST "$SLACK" -d "{\"text\":\"[$JOB] FAILED after 3 attempts — see $LOG\"}"
exit 1
# Output (when refunds runs Tuesday 03:15):
[refunds] previous run still active, skipping
# (because orders.sh started at 02:00 and is still running 75 min in)
The wrapper is 38 lines. It addresses three of the five gaps (locks, dependencies, retries) and partially addresses the fourth (Slack notifications cover most failures but miss segfaults and OOM kills). Six things are still wrong with it.
flock prevents a single job from overlapping itself, but two different jobs (orders and refunds) sharing a temp file still race because the lock is per-job, not per-resource.
The sentinel directory is a global shared filesystem — a sentinel from yesterday's run is still there at 02:00 today, so the polling loop returns instantly and runs against stale orders.
The retry loop uses fixed exit codes (success vs failure) — a rate-limited 429 from an upstream API is retried with the same 60-second backoff as a transient TCP error, neither of which is the right strategy for a 429.
The dependency encoding is hardcoded case matching — every new job requires editing this script and redeploying it.
The log files accumulate forever in /var/log/etl until the disk fills.
The Slack webhook has no concept of "alert ack" — every failure pings the channel, and the on-call engineer learns to mute it because half the alerts are transient and self-heal.
The wrapper does the work that cron itself does not. It is also where every minor regression in the team's pipeline lives. Each new requirement (longer dependency chain, per-resource locking, exponential backoff, log retention, alert deduplication) adds 10–30 more lines. By the time the team's eleventh job goes live, run.sh is 280 lines, three engineers have edited it, the test coverage is zero, and nobody is sure which version is currently deployed on which VM.
Why this script is the universal proof that a real scheduler is needed: every part of run.sh is something Airflow, Dagster, Prefect, or even a custom 200-line DAG executor (Build 4, chapter 21) does for you, declaratively, with a UI to inspect state. The case is not "Airflow is faster" or "Airflow has more features"; the case is that the team has already written a scheduler — badly, in bash, with no tests, no UI, and no central state. Switching from run.sh to a real scheduler is not an upgrade; it is replacing one scheduler with another that the team did not have to write themselves.
The cost of staying past the wall
Teams that don't cross the wall pay in a specific currency: incidents that happen at 3 a.m., are diagnosed at 5 a.m., and produce a Jira ticket at 9 a.m. that says "investigate". The Aditi-at-Bengaluru-D2C cost profile is roughly the median for an Indian product company with one or two data engineers and a 50-job cron stack:
| Symptom | Frequency | Person-hours per incident | Annual cost |
|---|---|---|---|
| Two jobs overlap on a shared file | once a quarter | 6 (debugging + bash patch) | 24 hr/yr |
| Upstream takes longer, downstream runs on stale data | once a month | 4 (Slack thread + backfill + dashboard fix) | 48 hr/yr |
| Cron job silently fails, dashboard wrong | twice a year | 16 (escalation + post-mortem + backfill) | 32 hr/yr |
| Schedule drift causes SLA breach | once a year | 24 (customer-facing escalation) | 24 hr/yr |
run.sh regression (the wrapper itself broke) |
twice a year | 12 (rolling back, blame, fixing) | 24 hr/yr |
The total is around 150 person-hours per year on the wrapper-and-cron stack — about 4 person-weeks of one engineer, or 8% of one engineer's time, lost to the gaps cron does not fill. That number is consistent with what Razorpay's data-platform team reported in their 2024 internal post-mortem on their pre-Airflow era: ~12% of one engineer's time spent on bash-scheduler maintenance before the migration; ~3% spent on Airflow maintenance after. The migration cost was about 2 person-months. The break-even is well under a year.
The number teams actually pay attention to is not the engineering hours but the dashboard credibility cost. Once the morning standup learns "the dashboard might be wrong, check with data team before believing it", the data team's leverage on every other initiative drops. The CFO who can no longer trust the GST reconciliation reads a manual spreadsheet from finance instead, which means the data team's pipeline isn't load-bearing for any decision the CFO cares about, which means the data team's headcount budget for next year is smaller. The compounding cost of a hand-rolled scheduler is not the on-call hours — it is the slow erosion of trust in the data the team produces.
A second-order observation: teams that cross the wall on time tend to be teams whose data engineer has worked at a previous company that used Airflow or Dagster. Teams whose data engineer started their career at the current company tend to underestimate the wall — they have written run.sh and it works, and they don't know what they are missing because they have never used the alternative. The wall is, partly, an experience-distribution problem. The fix is for senior engineers to insist on the scheduler at the third or fourth job, not at the eleventh.
Why "trust in the dashboard" matters more than the engineering hours: pipeline incidents have two costs — the engineering hours to fix them, and the credibility hit when stakeholders learn the data sometimes lies. The first is bounded (at worst, an extra engineer hired); the second compounds across every decision the company makes. Once a CFO has seen one wrong GST reconciliation, they shadow-check every reconciliation forever. Once the growth team has seen one wrong attribution number, they double-source every campaign report. The data team's leverage on every initiative drops, and that drop never recovers automatically — it has to be re-earned, which takes about three quarters of clean dashboards. The wall is fundamentally a trust-loss problem dressed up as an engineering-hours problem.
A war story: the Tuesday morning that broke the dashboard
Aditi's third major incident in twelve months — the one that finally won the budget for an Airflow migration — is worth walking through end to end. It is the median Indian-fintech-D2C cron failure. If you recognise three steps of it from your own on-call history, you are past the wall and have not yet acknowledged it.
Monday 23:45 IST. A backend engineer deploys a fix to the orders service that adds a join to the customer-tier table on every order write. The deploy is correct; the fix takes the average write latency from 80ms to 140ms. Nobody on the data team is in the deploy chat.
Tuesday 02:00 IST. orders.sh fires from /etc/cron.d/etl. The extract from the orders OLTP, which historically completed in 22 minutes, is now slower because the source table is locked more often by the slower writes. The job runs for 41 minutes.
Tuesday 02:30 IST. attribution.sh fires. Its case block in run.sh polls for /var/run/etl/orders.done. The sentinel is not there yet. The polling loop sleeps 60 seconds, retries, sleeps 60, retries — for 30 minutes. At 03:00 the polling loop times out (for i in $(seq 1 30)), the job exits with code 2, the Slack webhook fires.
Tuesday 02:45 IST. forecast.sh fires. Same polling loop, same 30-minute timeout, exits at 03:15 with the same Slack alert.
Tuesday 03:00 IST. gst.sh fires. By now orders.done has been touched (orders finished at 02:41). The GST job sees the sentinel from yesterday — it was not cleared at the start of today's orders run. It happily reads from the old orders table snapshot, joins with today's inventory snapshot, and produces a GST report that is internally inconsistent (yesterday's orders against today's inventory).
Tuesday 03:14 IST. The on-call phone rings. Aditi opens Slack. There are 23 alerts, four of which are the failed sentinel polls, two of which are flock "previous run still active" warnings, and seventeen of which are the routine "ok in 412s" messages from successful jobs that the team has learned to ignore. The four failed-sentinel alerts are buried.
Tuesday 04:30 IST. Aditi finds the failed jobs, reruns them manually. The attribution numbers are now correct. The GST report is wrong but nobody has looked at it yet.
Tuesday 09:00 IST. Standup. The CFO asks why the GST reconciliation says ₹2.4 crore yesterday and ₹1.8 crore today when daily order volume is roughly flat. Aditi spends the rest of the morning rebuilding yesterday's GST report from raw orders. The report is delivered at 14:30. Trust in the GST report drops permanently — the CFO instructs the finance team to cross-check it manually for the next quarter.
Tuesday 16:00 IST. Aditi files a ticket: "migrate scheduling to Airflow". It is the third such ticket in twelve months. This time the CFO co-signs it. Budget is approved by Friday.
Wednesday 11:00 IST. Aditi writes a post-mortem. The five contributing factors are: slower upstream extract, polling-loop timeout sized for normal-day variance, sentinel files not cleared at the start of each run, alert fatigue burying real failures, and no UI to surface DAG state. Each factor is fixable in run.sh with another 15–30 lines. None of them are fixable simultaneously without rewriting run.sh from scratch. The post-mortem's conclusion is one sentence: "the cost of patching the wrapper exceeds the cost of replacing it".
Thursday 09:00 IST. The post-mortem is read in the engineering all-hands. The CFO asks the head of engineering whether other teams have similar fragility. The answer is yes — the marketing pipeline, the inventory pipeline, and the recommendation pipeline all run on similar bash wrappers, and all are within one or two incidents of their own Tuesday morning. The migration scope expands from one team to the whole data platform. The Airflow rollout is staffed at one engineer full-time and one part-time for the next quarter.
The pattern in this incident is not "cron is broken" — every individual component worked as designed. The pattern is that the gaps between the components compounded. The slower orders extract was a 19-minute delay; the polling loop's 30-minute timeout was sized for a 30-minute delay; the sentinel-not-cleared bug was a separate logic error in run.sh; the alert fatigue was a culture problem; the GST job consuming the wrong sentinel was a data-correctness bug that none of the other failures pointed to. The cost was ₹2.4 crore of finance team hours and one quarter of CFO trust.
A real scheduler would have prevented every step of this. Airflow's scheduler tracks task state in a database, not in sentinel files; the gst task would not start until orders of today's run completed, not when some orders.done file existed. Retries are configured per-task with exponential backoff and dead-letter queues; alerts are deduplicated and routed by severity, not by job; the UI shows the state of every DAG at a glance, not buried in 23 Slack messages. Building all of that in bash is possible. Building it correctly in bash, with tests, in a way that survives three engineers editing the same script, is not.
The fix the next build delivers
The response to this wall is Build 4 — chapters 19 through about 32 — which builds a real scheduler from scratch in 200 lines of Python before introducing Airflow, Dagster, and Prefect. The build's argument is that a scheduler is a finite-state machine over a DAG, with retries, sensors, and SLA tracking layered on top, and that understanding the FSM is more important than memorising any one tool's UI.
Concretely, the next four chapters do the following. Chapter 19 (cron's three flaws) decomposes the gaps this chapter listed into three structural failures of cron: time-based scheduling (instead of dependency-based), no retry semantics, no observability. Chapter 20 (the DAG abstraction) introduces directed acyclic graphs as the right primitive — the dependency graph the operator was holding in their head becomes the data structure the scheduler operates on. Chapter 21 (DAG executor in 200 lines) builds a working executor in Python that parses a YAML DAG, runs tasks in topological order, retries on failure, and writes state to a SQLite file. The 200 lines do everything run.sh does and more. Chapter 22 (task dependencies) generalises wait-for to fan-out, fan-in, and conditional branches.
By chapter 23 (retries and timeouts) the executor is starting to feel like Airflow's scheduler. By chapter 26 (SLA tracking) it has features Aditi's run.sh would never grow. By chapter 30 the chapter pivots to "now compare with Airflow" — and the reader who built the executor reads the Airflow source with comprehension, not awe.
The wall is the motivation for the whole build. Every chapter from 19 to 32 answers a specific gap that this chapter identified. The reader who finishes Build 4 has the conceptual model to evaluate any scheduler — Airflow, Dagster, Prefect, Kestra, Mage, the next one — by mapping its features to the FSM-over-DAG primitive and asking which gaps it closes.
A subtle point about the build's pedagogical order: it builds the scheduler before introducing the production tools, not after. The reverse order is more common — most data-engineering courses start with "here is Airflow, here is a DAG" and the reader copies the syntax without understanding why the FSM exists. Build 4's order is deliberate: build the thing, feel why each feature is necessary, then read Airflow's source and recognise every component. The reader who learns Airflow first and the FSM second carries a vocabulary of brand names; the reader who learns the FSM first and Airflow second carries a vocabulary of mechanisms. The mechanism vocabulary transfers across tools (the same FSM exists in Dagster, in Prefect, in every batch scheduler ever built); the brand vocabulary does not. When the reader's company decides to migrate from Airflow to Dagster three years later — or from Dagster to a homegrown thing — the mechanism vocabulary is what makes the migration tractable.
The build has one further argument that this chapter has been hinting at: the FSM-over-DAG primitive is the same primitive that streaming systems (Build 7+), CI/CD systems (Jenkins, GitHub Actions), workflow engines (Temporal, Cadence), and even some compiler IRs use. The reader who internalises the primitive in Build 4 carries it forward into the streaming chapters where it reappears as Kafka Streams' processing topology, into the lakehouse chapters where it reappears as Iceberg's manifest tree, and into the production-engineering chapters where it reappears as Temporal's saga pattern. The 200 lines of Python in chapter 21 are the most leveraged 200 lines in the curriculum; the cost of that leverage is reading them carefully once.
The signals that you are at the wall
The reader who has not yet hit the wall but is wondering whether they are about to should look for these specific signals. Any three of them mean you are within a quarter of the dramatic incident; any five mean it has already happened and you have not noticed.
- The wrapper script you call from
crontab -lis more than 50 lines. - You have at least one comment in
run.shthat says# TODO: figure out why this fails. - The on-call rotation has a Slack channel called
#etl-noiseor#data-alertswhere the median message is muted by half the team. - At least one cron entry has the comment
# wait at least N min between this and the previous one. - A new pipeline takes longer to set up than the actual SQL of the pipeline (because the wrapper needs to be customised).
- The newest engineer cannot answer "what runs at 02:30?" without grepping the cron file.
- A schedule change requires editing 4+ cron entries to maintain the gap invariant.
- You have written your own log-rotation cron entry because
/var/log/etlfilled up the disk last quarter. - The dependency graph "lives in everyone's head" — there is no single document that lists which job needs which.
- A pipeline failure was diagnosed as "schedule drift" within the last six months.
- You once said "the cron job ran but didn't do anything" and meant it literally.
The signals are deliberately concrete. The abstract version is "your operational complexity has outgrown your tooling", but engineers find that hard to act on. The concrete signals are checkable in 10 minutes by reading your own cron file and Slack history. If the count is high, the next sprint should start chapter 21 of this curriculum, not a new ETL pipeline.
Common confusions
- "Cron is the problem and Airflow is the answer." Cron is the right answer for a small number of independent jobs; Airflow is the right answer for many interdependent jobs. The wall is between them. A team running two cron entries that don't talk to each other should not migrate to Airflow — the operational overhead of running an Airflow cluster (PostgreSQL backend, scheduler process, web UI, executor pool) outweighs the value at that scale. The migration is a function of job count and dependency depth, not of cron being inherently bad.
- "
run.shis a scheduler I built; it works, so I don't need Airflow." The script is a scheduler, but the comparison is not "is it a scheduler?" but "does it have the properties a scheduler needs at this scale?". Past the wall the answer is no — it lacks central state, a UI, structured retries, alert deduplication, and tests. The argument for switching is not thatrun.shis bad; it is that maintainingrun.shpast 5 jobs costs more than running Airflow. - "The dependency graph is just a wait-for between jobs." Real DAGs include fan-out (one parent, many children), fan-in (many parents, one child), conditional branches (run B only if A produced rows), and dynamic DAGs (the set of children is computed from data). A pure wait-for chain is the simplest case; production graphs hit all four patterns within the first ten jobs.
- "
flocksolves overlap, so locks aren't a problem."flocksolves single-job overlap (job at 02:00 still running when next 02:00 fires). It does not solve multi-job resource contention (orders.sh and refunds.sh both writing to the same temp file). Multi-job locking requires per-resource locks, which require knowing which resources each job touches — which is exactly what a DAG executor's task graph encodes andflockdoes not. - "Cron timing drift is small enough to ignore." A single drift of 30 seconds is small. Compounded across 47 cron entries over 12 months it produces overlapping schedules where the operator's mental model says "30 minutes apart" and the actual gap is 8 minutes. The cost of the drift is not the drift itself; it is that the operator's belief about the schedule diverges from reality, and the divergence is invisible until the incident fires.
- "I'll just write a more careful
run.sh." Every bash-scheduler the author has seen in 12 years of consulting at Indian product companies has converged toward the same shape — a 300–600 line script with embedded YAML, aneval-based templating layer, and a SQLite database for state. At that point the team has reinvented Airflow's scheduler module in shell, with 0% of its test coverage. The right move is to use the tool that has the test coverage, not to keep growing the wrapper.
Going deeper
Why "5 jobs" specifically — the dependency-depth heuristic
The number is empirical, not derived. Surveys of data-engineering teams (Locally Optimistic 2023, dbt Slack pulse 2024, Airbyte 2024 State of Data Engineering) put the median migration point at 6–8 cron entries with at least one dependency chain of depth 3. Teams with all-independent jobs tolerate 20+ cron entries before migrating; teams with deep dependency chains migrate at 4. The rough rule is: when the deepest dependency chain reaches length 3, build a DAG. A chain of depth 3 (A → B → C) means B's runtime variance compounds across two scheduling windows; the operator can no longer pick a fixed wall-clock gap that works in all cases.
The bash-scheduler convergence and the pre-Airflow era at Indian unicorns
A pattern noticed across hundreds of company-internal post-mortems: bash schedulers converge on a near-identical architecture as they grow. Lock with flock, sentinel files for dependencies, retry loop with exponential-ish backoff, Slack/PagerDuty webhook for alerts, log rotation via cron, state in a per-job text file or SQLite. The convergence is not because one team copied another — it is because the constraints of bash + cron force the same set of solutions. The shape of the script you would write today is the shape of the script Razorpay's first data engineer wrote in 2018, of Swiggy's first data engineer wrote in 2017, of every Y Combinator startup's first data engineer wrote in 2020. The fact that all bash schedulers converge to the same shape is itself the argument that the shape is real, namable, and worth abstracting — which is exactly what Airflow's scheduler does. Razorpay, Swiggy, Cred, Zomato, and Meesho all had a bash-scheduler era; the migrations happened between 2018 and 2021, mostly to Airflow, with one notable Dagster migration (Cred, 2022). The internal post-mortems converge on three lessons: the migration was easier than expected because the bash scheduler had already encoded the DAG (in case statements and sentinel files) and translation to Airflow YAML was mechanical; the most painful part was data-quality issues that were masked by silent failures and surfaced loudly once Airflow's UI made every failure visible; teams that delayed the migration past 100 cron entries spent 3× the effort of teams that migrated at 30. Don't wait for the wall to be obvious — by then it has been costing you for a year.
The "operator-defined SLA" pattern that cron cannot express
A real scheduler lets the operator declare: "the orders pipeline must complete by 06:00 IST every day, and if it does not, page the on-call". That declaration is structured — it is a constraint over the DAG's terminal node, not a statement about any individual task. Cron cannot express it because cron has no notion of "the DAG completed". The closest workaround is a separate cron entry at 06:00 that checks for a sentinel file and pages if absent — which works, but is yet another 30-line wrapper that has to be kept in sync with every change to the DAG's structure. The SLA-as-DAG-property pattern is one of the cleanest illustrations of why the DAG primitive is correct: a property over the graph (terminal node completed by time T) cannot be encoded outside the graph without redundancy.
Migration mechanics: managed vs self-hosted, single-task vs multi-task DAGs
Once the team has decided to migrate, two practical questions follow. Managed or self-hosted? Indian-market options in 2026 are AWS MWAA (₹35–60k/month), GCP Cloud Composer (₹45–80k/month), Astronomer Cloud (₹17k+/month), or self-hosted on a small EC2/GCE instance (₹4–8k/month plus engineering time). For a team that has just crossed the wall, start managed — the marginal cost over a self-hosted box is small compared to the engineering cost of debugging a scheduler-process crash at 2 a.m. while the team is still learning Airflow. The arithmetic typically flips around 20+ DAGs and 200+ tasks/day, but only if the team has dedicated DevOps capacity. Single-task or multi-task DAGs? A common mistake is translating every cron entry into a separate DAG, producing 47 single-task DAGs that share dependencies via ExternalTaskSensor references — which is the sentinel-file polling pattern in a different syntax. The right translation collapses dependency-related cron entries into single multi-task DAGs: orders → (attribution, forecast, gst, refunds) becomes one five-task DAG. The signal the translation is wrong is ExternalTaskSensor calls in the codebase; the right pattern uses native task dependencies (a >> [b, c, d]). A specific Indian-context hazard that breaks bash schedulers earlier than the "5 jobs" rule of thumb: the Tally GST exporter, the Razorpay-account-statement CSV downloader, the Zerodha contract-note ingest. These are jobs owned by the finance team but living in the data team's /etc/cron.d/etl. Most Indian companies have at least one; the wall arrives at 3 jobs instead of 5 because the cross-team coordination cost is built in.
Worked example: Aditi's migration ROI
Aditi's team eventually migrated. The numbers from her post-migration retrospective: before — 47 cron entries, ~150 person-hours/year on run.sh maintenance, 1 SLA breach per year (₹40 lakh in customer-credit cost the year it happened), 3 a.m. pages roughly twice a month. Migration effort — 6 person-weeks (Aditi + one junior engineer, working part-time over two months) to set up Airflow on a managed instance (AWS MWAA, ₹40k/month), translate cron entries into DAGs, migrate run.sh callsites to BashOperator, and run the two systems in parallel for two weeks before cutting over. After — 12 DAGs (each with 3–6 tasks), ~25 person-hours/year on Airflow maintenance, zero SLA breaches in 18 months, 3 a.m. pages roughly once a quarter. ROI — the migration paid back in about 8 months on engineering time alone, and that ignores the ₹40 lakh customer-credit savings that the absence of SLA breaches produced. The number that matters most to Aditi's CFO was the second one: the data team became trustworthy again, and the morning standup stopped doubting the dashboard.
Where this leads next
- cron: the simplest scheduler and its three flaws — chapter 19, the systematic decomposition of why cron breaks
- The DAG as the right abstraction — chapter 20, the data structure that replaces the cron file
- Writing a DAG executor in 200 lines — chapter 21, building the scheduler the reader could not stop themselves from building in bash
- Task dependencies: wait-for, fan-out, fan-in — chapter 22, generalising the dependency graph beyond chains
- Retries, timeouts, and poisoned tasks — chapter 23, the retry semantics
run.shgot wrong
Build 4 is the response to this wall. By chapter 32 the reader has a 200-line scheduler that does what run.sh does, plus everything run.sh does not, plus a UI, plus tests. The reader who finishes the build can read Airflow's source code with comprehension and decide for themselves which scheduler tool to adopt — Airflow, Dagster, Prefect, Kestra — based on which gaps each tool closes for their team's specific load profile.
References
- Airflow scheduler architecture — the canonical reference for how a real scheduler decomposes the FSM over the DAG.
- Dagster's "Software-defined assets" introduction — the alternative framing where the asset, not the task, is the primitive.
- The dbt Slack 2024 pulse on scheduler tooling — community survey of who uses what, with the migration-from-cron data point.
- Locally Optimistic: when to graduate from cron — the canonical post on the wall, written by ex-Stitch Fix data engineers.
- Razorpay engineering blog: data platform 2018–2024 — public-facing post-mortems including the bash-scheduler era.
- Posix
flockman page — the lock primitive every bash scheduler converges toward. - Schema drift: when the source changes under you — the previous chapter, the column-axis sibling of this chapter's schedule-axis problem.
- Crontab.guru and
crontab(5)— the cron syntax reference, useful both as the floor and as the limit of what the syntax can express.
The honest summary: cron is not bad. Cron is the right answer for a small number of independent jobs and stays the right answer for many years if the jobs do not grow dependencies. The wall is not "cron is bad"; it is "cron has done its job and the workload has outgrown it". The signal that the wall has been reached is not a single dramatic incident — it is the observation that the operator now spends more time on run.sh than on the actual ETL logic, and that every new requirement adds wrapper code instead of business logic. Build 4 is the response: a scheduler that closes the gaps cron leaves, built in 200 lines so the reader understands every line, before adopting Airflow or Dagster or Prefect with their eyes open.
A practical exercise: count the cron entries on your most production-critical pipeline VM. Count the lines of bash in the wrapper script(s) that those entries call. If the wrapper is more than 100 lines, you are already past the wall and have not yet noticed. Spend the next sprint building chapter 21's 200-line DAG executor — not to replace your stack today, but to feel the difference in your hands. The migration to a real scheduler will follow naturally once you have the conceptual map; without the map, the migration always feels like adopting somebody else's complexity instead of replacing your own.
A second exercise, harder but more honest: pick the most recent 3 a.m. page in your team's history. Write the post-mortem in a specific shape — what was the schedule cron believed it was running? What was the dependency graph the operator had in their head? Where did the two diverge? In nine out of ten cases the post-mortem will read like the Tuesday-morning narrative above, with the names changed. The fact that a generic post-mortem template fits is the strongest empirical signal that the wall is real. The shape of the bug is not idiosyncratic to your team's stack; it is an emergent property of using cron past the point where cron's interface stops matching your workload.
The longer arc, after the team migrates and lives with a real scheduler for a year, is that the conversation about pipelines shifts. The team stops debating "is the schedule right?" and starts debating "is the SLA achievable given our retry budget?". The questions become structural — about graph topology, parallelism limits, retry semantics, alert routing — rather than about wall-clock minutes. That shift is what graduating from cron actually buys you: not faster pipelines, but the ability to think about pipelines at the right level of abstraction. The 200-line DAG executor in chapter 21 is the lever that makes that shift possible. Cron made the schedule the conversation; the DAG makes the dependencies the conversation; that is the only conversation that scales past the eleventh job.