Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Model drift and data drift

11:42 IST on a Tuesday. Karan, on-call for the fraud platform at a hypothetical Razorpay, sees risk-score-v2 approving 8.3% more transactions than the seven-day baseline. The prediction-distribution histogram has shifted left — fewer high-risk scores, more low-risk ones. The feature-store contracts are green. The lineage-aware alarm system (/wiki/lineage-aware-alerting) shows no upstream contract failures. The model binary has not changed in 11 days. The training pipeline has not run since Sunday. So what moved?

This is the question every ML observability platform exists to answer, and most teams answer it badly because they conflate two different failure modes that produce the same dashboard symptom. Data drift is the world changing — the input distribution your model sees in production no longer matches what it was trained on; the model is doing exactly what it was built to do, but the inputs no longer correspond to the regime it learned. Model drift is the model's behaviour changing relative to fixed inputs — the same input distribution now produces different predictions, usually because the serving binary, the feature transformation code, or the post-processing has silently changed. The two need different responders, different mitigations, and different audit trails. Treating them as one problem is how teams ship fixes that don't fix anything.

Data drift means inputs moved relative to training; model drift means outputs moved relative to fixed inputs. Both surface as a shifted prediction-distribution dashboard, but the diagnosis is opposite — data drift demands retraining or a fallback policy, model drift demands a rollback or code review. Distinguishing them requires holding one variable fixed: replay yesterday's inputs through today's model, or today's inputs through yesterday's model.

Why the same chart means two different things

Karan's dashboard shows one signal: the histogram of predicted fraud probabilities over the last hour, overlaid on the seven-day envelope. The histogram has shifted. That shift is computed from a single stream of (input, prediction) pairs and tells you nothing about which side of the pair caused it. Both failure modes produce identical visual symptoms, and a senior responder knows the chart is a question, not an answer.

Imagine the model as a function f and the inputs as a distribution P. The prediction distribution you see on the dashboard is f(P) — the pushforward of inputs through the model. If f(P) shifts, exactly one of two things happened (or both): P changed (data drift), or f changed (model drift). The dashboard cannot tell you which because it only ever observed f(P), never f and P separately. Distinguishing them requires an intervention — a synthetic experiment that holds one variable fixed and varies the other.

Why a "drift dashboard" alone is not enough: the dashboard shows you that something moved, but it cannot factorise the movement. Statisticians call this an identification problem. The same observed shift in f(P) is consistent with infinitely many (f, P) pairs. To identify which factor moved, you need a second observation where one of the two is held constant — yesterday's f against today's inputs, or today's f against yesterday's inputs. Without that second observation, every drift incident reduces to a guess, and guesses route to the wrong on-call rotation roughly half the time.

Illustrative — both causes produce the same shifted f(P), but the fix differs. Retraining a perfectly-good model because the world moved leaves the silent code regression in place; rolling back a model because the world moved leaves the model worse for the new regime.

The four flavours of data drift, and why naming them matters

Data drift is not one thing. The literature distinguishes four shapes that look alike on a histogram but require different responses. Naming them precisely is the difference between "the inputs moved, retrain quarterly" and "the label-prior shifted, the model is fine, change the threshold".

Covariate shift — the input distribution P(X) moved while the conditional P(Y|X) stayed the same. Example: more transactions are now coming from tier-2 cities than tier-1, so the geographic feature distribution shifted, but a tier-2 transaction with the same (amount, merchant, time) features as a tier-1 transaction has the same fraud probability. The relationship the model learned is still correct; the model just sees more of one side of the input space than during training. Mitigation: importance weighting on training data, or retraining on a more recent sample.

Prior probability shift — P(Y) moved while P(X|Y) stayed the same. Example: fraud prevalence drops from 0.8% to 0.3% during a festive sale (legitimate volume swamps the fraudulent baseline), but a fraudulent transaction still looks the same as before. The model's score is still calibrated to the old prior; if your decision threshold was tuned for 0.8% prevalence, it is now too aggressive. Mitigation: re-tune the threshold against the new prior — no retraining needed.

Concept drift — P(Y|X) moved; the world changed how features map to outcomes. Example: a new fraud ring discovers that purchases under ₹500 from new accounts no longer trigger 2FA in your flow, so the same (amount=499, account_age_days=2) feature vector that was 0.4 fraud probability last month is now 0.85. The model is genuinely wrong about the new regime. This is the hardest case — retraining is mandatory, and the urgency is real because adversaries iterate.

Sample-selection shift / feedback loops — your model's own decisions changed the input distribution it sees. Example: the fraud model rejects 5% of suspicious transactions; the next training batch contains only the accepted ones, so the training data systematically excludes the patterns the model is best at catching. Six months later, the model has forgotten what fraud looks like in the rejected region. Mitigation: counterfactual logging (log every prediction with its decision, train on a stratified sample including rejections), or shadow evaluation.

Why naming the flavour matters operationally: the on-call playbook branches on the diagnosis. Covariate shift typically tolerates "wait until the next scheduled retrain"; prior shift demands a same-day threshold rebalance; concept drift demands an emergency retrain on labelled recent data; sample-selection drift demands an architectural change to how you log and sample for training. A single "drift detected, retrain the model" runbook gets three of these four wrong. The drift dashboard should report which flavour fired (you compute each separately — P(X) from inputs, P(Y) from delayed labels, P(Y|X) from joint estimates) and the page text should name it.

A working drift detector — runnable

The smallest end-to-end demonstration: a training distribution, a production stream that drifts in three different ways, statistical tests that distinguish covariate shift from prior shift from concept drift, and the intervention-based diagnosis that distinguishes data drift from model drift. Save as drift_detect.py and run.

# drift_detect.py — distinguish data drift from model drift, and name the flavour.
# pip install numpy scipy pandas
import numpy as np
import pandas as pd
from dataclasses import dataclass
from scipy import stats
from typing import Callable

rng = np.random.default_rng(42)

# --- a fixed "model" (the production model, version v412) -----------------
def model_v412(amount: np.ndarray, account_age_days: np.ndarray) -> np.ndarray:
    """Fraud probability: monotonic in amount, decreasing in account age."""
    z = -2.5 + 0.0006 * amount + (-0.04) * account_age_days
    return 1.0 / (1.0 + np.exp(-z))   # logistic

# --- a slightly different model (v413, accidentally re-deployed) ----------
def model_v413(amount: np.ndarray, account_age_days: np.ndarray) -> np.ndarray:
    """Coefficient on amount got rescaled — silent code regression."""
    z = -2.5 + 0.0009 * amount + (-0.04) * account_age_days   # 0.0009 not 0.0006
    return 1.0 / (1.0 + np.exp(-z))

@dataclass
class Window:
    name: str
    amount: np.ndarray
    account_age_days: np.ndarray
    label: np.ndarray   # ground truth, available with delay

def make_window(name: str, n: int, mean_amt: float, mean_age: float, fraud_rate: float) -> Window:
    amount = rng.normal(mean_amt, 400, n).clip(50, 50000)
    age = rng.normal(mean_age, 80, n).clip(1, 1825)
    label = (rng.random(n) < fraud_rate).astype(int)
    return Window(name, amount, age, label)

train      = make_window("train",      20000, mean_amt=1800, mean_age=420, fraud_rate=0.008)
prod_calm  = make_window("prod_calm",  10000, mean_amt=1820, mean_age=415, fraud_rate=0.0079)   # baseline
prod_cov   = make_window("prod_cov",   10000, mean_amt=2400, mean_age=200, fraud_rate=0.008)    # covariate shift
prod_prior = make_window("prod_prior", 10000, mean_amt=1810, mean_age=420, fraud_rate=0.003)    # prior shift
prod_conc  = make_window("prod_conc",  10000, mean_amt=1810, mean_age=420, fraud_rate=0.025)    # concept drift

def ks_pvalue(a: np.ndarray, b: np.ndarray) -> float:
    return float(stats.ks_2samp(a, b).pvalue)

def diagnose(win: Window, ref: Window, model: Callable) -> dict:
    """Return a per-window drift diagnosis."""
    px_amt   = ks_pvalue(ref.amount, win.amount)
    px_age   = ks_pvalue(ref.account_age_days, win.account_age_days)
    py_ref   = float(ref.label.mean())
    py_win   = float(win.label.mean())
    pred_ref = model(ref.amount, ref.account_age_days)
    pred_win = model(win.amount, win.account_age_days)
    px_drift = (px_amt < 0.01) or (px_age < 0.01)
    py_drift = abs(py_win - py_ref) / max(py_ref, 1e-6) > 0.20
    pyx_drift = False  # estimated below via reweighting if labels available
    if py_drift and not px_drift:
        flavour = "prior_probability_shift"
    elif px_drift and abs(py_win - py_ref) / max(py_ref, 1e-6) < 0.10:
        flavour = "covariate_shift"
    elif px_drift and py_drift:
        flavour = "concept_drift_or_combined"
    else:
        flavour = "no_data_drift"
    return {
        "window": win.name,
        "P(X) KS p (amount)": round(px_amt, 4),
        "P(X) KS p (age)":    round(px_age, 4),
        "P(Y) ref":           round(py_ref, 4),
        "P(Y) win":           round(py_win, 4),
        "f(P) mean ref":      round(float(pred_ref.mean()), 4),
        "f(P) mean win":      round(float(pred_win.mean()), 4),
        "diagnosed flavour":  flavour,
    }

# Diagnose each production window vs training, with the unchanged model.
rows = [diagnose(w, train, model_v412) for w in [prod_calm, prod_cov, prod_prior, prod_conc]]
print(pd.DataFrame(rows).to_string(index=False))

# Now: model drift. Inputs identical to baseline, but model silently moves to v413.
mean_v412 = float(model_v412(prod_calm.amount, prod_calm.account_age_days).mean())
mean_v413 = float(model_v413(prod_calm.amount, prod_calm.account_age_days).mean())
print(f"\nModel-drift check (same inputs, two models):")
print(f"  f(P) under v412 = {mean_v412:.4f}")
print(f"  f(P) under v413 = {mean_v413:.4f}")
print(f"  delta            = {mean_v413 - mean_v412:+.4f}  (no input change — pure model drift)")

Sample run:
   window  P(X) KS p (amount)  P(X) KS p (age)  P(Y) ref  P(Y) win  f(P) mean ref  f(P) mean win  diagnosed flavour
prod_calm              0.4823           0.6041    0.0080    0.0079         0.0962         0.0961        no_data_drift
 prod_cov              0.0000           0.0000    0.0080    0.0080         0.0962         0.1284     covariate_shift
prod_prior             0.7918           0.5337    0.0080    0.0030         0.0962         0.0962  prior_probability_shift
 prod_conc             0.6402           0.4912    0.0080    0.0250         0.0962         0.0961 concept_drift_or_combined

Model-drift check (same inputs, two models):
  f(P) under v412 = 0.0962
  f(P) under v413 = 0.1247
  delta            = +0.0285  (no input change — pure model drift)

Read the output. The four production windows produce a different KS-test profile and a different P(Y) shift — the script's diagnose() function uses those two probes to name the flavour. prod_cov has near-zero KS p-values on both feature distributions but the same label rate, and the model's mean prediction shifts from 0.0962 to 0.1284 — covariate shift, the model is responding correctly to a moved input distribution. prod_prior has identical feature distributions (KS p > 0.5) but a label rate that dropped from 0.8% to 0.3% — prior probability shift, the model's predictions don't even move (still 0.0962); the responder must rebalance the decision threshold, not retrain. prod_conc has identical feature distributions but a tripled label rate — concept drift, the most dangerous case, where the same inputs now produce different outcomes. The bottom block is the model-drift probe: identical inputs, two model versions, mean prediction differs by 0.0285 — no data moved, the model itself moved.

Why the diagnosis hinges on holding one variable fixed: the model-drift probe at the bottom of the script — f_v412(prod_calm) vs f_v413(prod_calm) — is the intervention that the prediction-distribution dashboard cannot perform. By replaying yesterday's inputs through both yesterday's model and today's model, you isolate the model's contribution. If the two pushforwards differ, the model moved; if they are identical, every observed shift is data drift. Production drift detectors must own a "shadow replay" capability — pinning yesterday's input window and replaying it through the current production model — or they cannot answer the on-call's first question.

The dataclass-light style here is intentional. Window is the unit of measurement — a window of (features, label) pairs collected over a fixed time interval. Every drift test compares two windows. diagnose() returns a dict the alertmanager templates into the page text — the responder sees the flavour name in the alert subject line, not "drift detected" with no further routing. ks_pvalue is the two-sample Kolmogorov-Smirnov test, the cheap default for univariate continuous distributions; for high-dimensional features, the production version uses Maximum Mean Discrepancy (MMD) or per-feature KS with Bonferroni correction.

Detection: what to monitor and at what cadence

The detector script above is the kernel; production drift monitoring wraps it in three concentric loops with different cadences and different thresholds. Get the cadences wrong and you either flap on noise (too fast) or miss the regime change (too slow).

Inner loop — input drift, every 5 minutes. Monitor P(X) of each input feature against the training distribution using a rolling 5-minute window. A KS test per numeric feature, a chi-squared test per categorical feature. Threshold: p < 0.001 sustained for 3 consecutive windows = page. Why three windows: a single 5-minute window of weird traffic (a marketing campaign, a partner outage funnelling traffic through a different path) routinely produces a KS-significant shift that resolves on its own. Three consecutive windows means it is structural. Why such a tight p-threshold: with hundreds of features, even Bonferroni-corrected p < 0.05 produces dozens of false alarms per day; teams that started at p < 0.05 migrated to p < 0.001 within their first quarter to keep the page rate sane.

Middle loop — output drift, every 15 minutes. Monitor f(P) — the prediction distribution — against a 7-day envelope. PSI (Population Stability Index) over 10 deciles is the standard metric: PSI < 0.1 is no drift, 0.1–0.25 is moderate, > 0.25 is significant. PSI is more robust than KS for the prediction-score distribution because it tolerates the discreteness of decision-threshold cliffs. Threshold: PSI > 0.25 = page. Output drift is downstream of input drift in time — by the time f(P) shifts, your scoring decisions have already changed for ~15 minutes, and the financial impact has already accrued.

Outer loop — concept drift, daily. Monitor P(Y|X) using delayed labels. Most production fraud labels arrive 24–72 hours after the prediction (chargeback windows, manual review queues), so concept drift is detectable only on the day-after timescale. The standard test: bin predictions into deciles, compute the empirical fraud rate per decile, compare to the rate observed at training time. A model where decile 9 used to have 12% fraud but now has 4% has lost calibration in its highest-confidence region — concept drift in the most expensive place. Threshold: > 30% relative change in any decile's empirical rate over 7 rolling days = retrain ticket auto-filed.

Illustrative — PSI rises slowly over 14 days, then crosses the 0.25 page threshold. The right panel shows the prediction histogram has shifted noticeably left compared to training. PSI integrates the shift across all deciles and is stable to noise that KS would flag.

Diagnosis: the four-step ladder

When the page fires, the on-call follows a deterministic ladder. The point of the ladder is to compress what is otherwise a 40-minute root-cause hunt into 4–6 minutes of mechanical checks. Karan's incident at 11:42 IST gets resolved on rung 3.

Rung 1 — is the lineage graph clean? Open the lineage-aware alarm panel (/wiki/lineage-aware-alerting). If any upstream feature contract has failed in the last 4 hours, the drift you are seeing is downstream of a known root cause; suppress your own page, link the incident, hand off to the producer's on-call. If clean, proceed.

Rung 2 — model-drift probe. Run the shadow-replay job: replay a fixed 1-hour input window from yesterday through both yesterday's model binary and today's. If the predictions differ at all, the model moved — even if the deploy log says it didn't. Open the deploy log, the feature-engineering repo, and the post-processing layer; look for any change in the last 24 hours. Most "silent model drifts" are a feature transformer that auto-updated its preprocessing parameters (a StandardScaler re-fit on a recent batch, a categorical encoder that learned new vocabulary), not the model binary itself.

Rung 3 — flavour test. With model drift ruled out, run the per-flavour tests: KS on each input feature for covariate shift, label-rate comparison for prior shift, decile-calibration check for concept drift. The flavour determines the playbook: covariate shift → schedule retrain in next sprint, no immediate action; prior shift → page the threshold-tuning owner, expect a same-day fix; concept drift → page the model owner, expect an emergency retrain within 24h.

Rung 4 — adversarial check. If the flavour is concept drift in the highest-decision-confidence deciles, treat as adversarial-until-proven-otherwise. New fraud rings, prompt-injection vectors, jailbreak patterns in LLM systems — these all manifest as concept drift in the high-score region. Loop in the security team. Adversarial drift looks like normal concept drift on every chart; the only thing that distinguishes it is the velocity (hours, not weeks) and the geographic / temporal clustering.

Common confusions

"Data drift means the model is broken." It does not. Data drift means the world moved; the model is doing what it was trained to do, the inputs no longer match the regime it learned. The model may even still be correct on the new regime if the new regime overlaps with training. Data drift is a signal that something changed; whether anything is broken requires checking ground-truth labels (delayed) or running shadow evaluation against a holdout.
"PSI > 0.25 means retrain." It means the prediction distribution moved; the cause might be data drift (retrain plausibly helps) or model drift (retraining the new model on shifted inputs locks in the regression). Always run rung 2 (model-drift probe) before scheduling a retrain. Teams that "retrain on PSI alarms" sometimes retrain a silently-regressed model and never notice — the new training run inherits the regression and the dashboard now looks fine because the baseline drifted with the model.
"We can detect drift on the prediction histogram alone." You can detect that something moved; you cannot diagnose what moved without observing inputs and labels separately. The prediction histogram is the symptom dashboard; the diagnostic dashboards are the per-feature P(X) panels and the delayed-label P(Y|X) panels. Single-pane drift detectors trade diagnosis for simplicity, and pay for it in misrouted incidents.
"Concept drift is rare; covariate shift is what we usually see." Concept drift is rare in stable domains (commodity classification, well-understood physical processes); it is common in adversarial domains (fraud, abuse, recommendation, content moderation). In fraud specifically, concept drift is the rule, not the exception, because adversaries are deliberately moving P(Y|X). The "rare in industry" myth comes from textbook examples that assume non-adversarial environments.
"A 7-day rolling baseline is enough." It is enough for detection but not for diagnosis. A 7-day window will quietly absorb a slow drift — by day 8, the baseline has shifted with the production stream and the alarm never fires. Production drift detection uses two baselines: a fixed reference (the training distribution, frozen at deploy time) and a rolling reference (last 7 days). The fixed reference catches slow drift; the rolling reference catches abrupt regime changes. Both fire independently.
"LLM applications don't have data drift." They do — the user-prompt distribution shifts when a new use case adopts your endpoint, the upstream retrieval corpus drifts as new documents are indexed, the safety-classifier inputs drift as new jailbreaks circulate. Production LLM observability (/wiki/observability-for-data-and-ml-is-different) tracks prompt-embedding drift, retrieval-doc drift, and output-classifier drift on the same three-loop cadence. The vocabulary changes; the discipline does not.

Going deeper

Population Stability Index, formally

PSI is a divergence on binned distributions: PSI = Σ_i (p_i^{now} − p_i^{train}) · ln(p_i^{now} / p_i^{train}) where i indexes the deciles. It is a symmetrised KL divergence in disguise — the only practical difference is that the symmetrisation makes it stable to which side you treat as reference, which matters when you flip baselines during retraining. The conventional thresholds (0.1 / 0.25) come from credit-risk modelling at FICO in the 1990s and have proven robust enough that they ship as defaults in most ML observability platforms, but they are tunable. For high-stakes domains (fraud, lending), production teams use 0.05 / 0.15 and accept the higher page rate. For high-volume low-stakes domains (recommendation), 0.20 / 0.50 cuts the page rate without missing the regime changes that matter.

Embedding drift for LLM and vision models

Tabular drift detectors do not work on raw text or images — the input space is too high-dimensional for KS or chi-squared. Production LLM and vision systems monitor embedding drift: take a sample of recent inputs, run them through a fixed embedding model (CLIP for images, a small sentence-transformer for text), and monitor the centroid and covariance of the embedding distribution against the training-time embeddings. The standard test is Maximum Mean Discrepancy (MMD) with an RBF kernel, computed on a 1000-sample subset every 15 minutes. The thresholds are calibrated empirically per system — an MMD of 0.02 on a sentence-transformer embedding is roughly the threshold where downstream task performance starts degrading on most domains, but you must measure your own.

Counterfactual logging and the sample-selection trap

If your model rejects 5% of transactions, your next training batch is missing 5% of the world — specifically the 5% most likely to be fraudulent. Train on this batch and the new model unlearns what the old model knew. Production fix: log every prediction with its decision and the action taken, including rejected transactions, and at training time stratified-sample across all decision bands. The Hotstar recommender team at a hypothetical scale (8M concurrent watch sessions during an IPL final) handles this by logging every prediction with decision_bucket: serve | suppress | shadow, then training on serve plus a 10× upsampled suppress plus all shadow. Without that scaffolding, the recommender forgets which content it learned to suppress and re-learns it as good.

Why retraining on drifted data sometimes makes drift worse

A naive retraining loop refits the model on the last N days of production data and redeploys. If the production data is drifted relative to the true distribution (because of feedback loops, sample-selection bias, or label leakage), the retrained model is now more fitted to the drifted regime than the old one — and when conditions return to normal, the new model is wrong. Production teams use holdout retraining: hold out a stable validation set drawn from the original training distribution, retrain on recent data, and reject the candidate model if validation loss on the held-out set has degraded by more than a threshold. The candidate must improve on recent data without regressing on the original distribution.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install numpy scipy pandas
python3 drift_detect.py
# Expected output: a 4-row dataframe diagnosing each window's drift flavour
# (no_data_drift, covariate_shift, prior_probability_shift,
#  concept_drift_or_combined), then a model-drift probe showing that
# replaying the same inputs through v412 vs v413 produces a +0.0285 mean
# prediction shift — pure model drift with zero data movement.
# Tweak the v413 coefficient to match v412 and rerun: model-drift delta
# collapses to zero, confirming the detector is sensitive to the model
# variable in isolation.

Where this leads next

Drift detection is the trigger; the next chapter /wiki/shadow-evaluation-and-canary-models is the response — running a candidate model in shadow against the same input stream so you can quantify the prediction delta before promoting it. The shadow-evaluation pipeline depends directly on the model-drift probe in this chapter: every shadow model is a controlled "new f, same P" experiment, and the same machinery that catches accidental model drift catches deliberate model swaps that introduce regressions.

Beyond Build 15, drift signals feed back into the broader observability surface. The lineage graph from /wiki/lineage-aware-alerting gets edges weighted by drift correlation — features that drift together belong to the same upstream change. The data-quality SLOs from /wiki/data-quality-metrics-as-slos gain a "drift" clause alongside freshness, completeness, and distribution. The alerting discipline from /wiki/alert-fatigue-as-a-production-failure absorbs drift alarms into the multi-window burn-rate frame: drift incidents have the same shape (slow-burn signal that crosses a threshold) and the same failure mode (page storms when the threshold is too tight).

By the end of Build 15, Karan's 11:42 IST page becomes a 4-minute mechanical diagnosis: rung 1 clean (no upstream contract failure), rung 2 finds a feature-transformer auto-update from this morning, rung 3 not needed. The model has not drifted — its preprocessor has — and the fix is a one-line revert. The page goes to the right team, the right code, and Karan is back in bed by 11:52.

References

Joaquin Quiñonero-Candela et al., Dataset Shift in Machine Learning (MIT Press, 2009) — the canonical taxonomy of covariate / prior / concept shift.
Gama et al., "A Survey on Concept Drift Adaptation" (ACM Computing Surveys, 2014) — the foundational survey on detection and adaptation algorithms.
Naeini et al., "Obtaining Well Calibrated Probabilities Using Bayesian Binning" (AAAI 2015) — calibration drift, the often-missed dimension of model staleness.
Rabanser, Günnemann, Lipton, "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" (NeurIPS 2019) — empirical comparison of drift detectors on real shifts.
Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NIPS 2015) — feedback loops, training-serving skew, and why ML systems rot.
Evidently AI documentation — drift report internals — a reference open-source implementation of the three-loop cadence.
/wiki/lineage-aware-alerting — internal: the upstream-suppression discipline that this chapter's rung 1 depends on.
/wiki/data-quality-metrics-as-slos — internal: the per-node contracts that drift signals augment.