Wall: nothing works without time
It is 03:11 and Karan is staring at three screens. The leader of PaySetu's settlement-ledger Raft group has just been declared dead by two of its four followers and alive by the other two. The split is not random — it is exactly along the boundary of which followers run on hosts whose NTP daemon last syncronised before the data centre's GPS antenna started reporting LEAP_NOT_IN_SYNC at 02:53. The followers whose clocks drifted forward by 14 ms decided their election timeout had expired; the followers whose clocks drifted backward by 9 ms still saw the leader as healthy. The leader itself is fine — its process is alive, its disk is fsync'ing, its replication thread is sending AppendEntries every 50 ms exactly as designed. The cluster has fractured not because the network failed, not because a node crashed, not because anyone deployed bad code, but because four machines stopped agreeing on what time it was. Karan opens the Raft protocol paper and re-reads the election-timeout section. Every sentence in it presupposes a clock. Every sentence in it lies.
Part 2 of this curriculum has built a complete failure model — partial failures, fail-stop / fail-slow / fail-silent, asymmetric partitions, gray failures. Every one of those models assumes a thing it never names: a working clock. Failure detection is "no message in T seconds"; T is a clock measurement. Retries are "wait, then resend"; the wait is a clock measurement. Leader leases are "valid until time X"; X is a clock. The wall this chapter names is that the failure-models toolkit is structurally incomplete until time itself becomes a first-class problem — which is what Part 3 is about.
Every failure-detection primitive is a clock with a threshold
Open any failure-detection literature and the algorithms read as decisions made about messages. "If a heartbeat does not arrive, mark the peer suspect." "If three consecutive heartbeats are missed, mark the peer failed." "If the response time exceeds the deadline, abort the call." Each of these sentences contains a hidden clause: as measured by which clock?
A heartbeat is sent at t_send on the sender's clock. It arrives at t_recv on the receiver's clock. The receiver's failure detector watches the gap t_recv_now - t_recv_last and trips when that gap exceeds a threshold T. Three things can make this trip incorrectly without any actual failure: (1) the network delivers the packet but the receiver's clock has jumped forward (NTP step, virtual machine pause, leap-second handling), so t_recv_now is artificially large; (2) the sender's clock has jumped backward, so the sender's heartbeat scheduling falls behind real time and the receiver sees a real gap that the sender did not intend; (3) the network really delayed the packet, but by less than T, while the receiver's clock simultaneously jumped forward, producing a combined gap that crosses the threshold. The detector cannot distinguish among these three causes because it only sees the gap, not the cause.
The T threshold is itself a clock measurement. When you tune Raft's election timeout to "150–300 ms", you are saying "150 ms as observed by the candidate's monotonic clock". If the candidate's monotonic clock drifts by 100 ppm (typical commodity quartz), 150 ms of intended timeout becomes 149.985 ms or 150.015 ms — usually irrelevant, but during a CPU temperature swing or a virtualised host's clock-source change, drift can briefly hit 1000 ppm and the actual interval can be 145 ms or 155 ms. In a tight election race this matters: the follower with a 145 ms effective timeout will start its election before the follower with a 155 ms effective timeout, every single time the cluster is under stress, not randomly as the protocol's randomised-timeout argument assumes.
The deeper observation: the failure-models language of "the node failed at time T" is meaningless without specifying whose clock measured T. Two nodes that disagree about whether a third node failed are not necessarily disagreeing about the third node — they may be disagreeing about time, and the third node is the prop on which their clock-disagreement projects. Part 2 has been silent about this because Part 2 is about failure modes; the moment you try to implement a detector for those modes, time becomes the load-bearing wall.
Why monotonic vs wall-clock matters here specifically: a wall clock can step backward when NTP corrects drift — the kernel can suddenly report 09:00:00.000 after just reporting 09:00:00.014, because NTP decided the local clock was 14 ms ahead of the canonical one and slewed it. If your election-timeout code uses wall-clock subtraction, it can suddenly compute a negative elapsed time, which the conditional then treats as "no time has passed" — the timer never trips, the failed leader stays "alive" forever from that follower's view. Monotonic clocks (Linux CLOCK_MONOTONIC, Go's time.Since, Python's time.monotonic) do not step backward, which is why every protocol implementation that has been debugged in production uses them for intervals and reserves wall-clock for timestamps that need to be compared across machines. The cost is that you cannot ask "is the lease still valid?" using monotonic time when the lease was issued by a different machine with its own monotonic origin — that is the gap Part 3 spends three chapters bridging.
Demonstrating the failure: a 70-line Python script that breaks under clock skew
The cleanest way to see the wall is to write the simplest possible failure detector, run it across two simulated nodes, then introduce clock skew between them and watch it produce wrong verdicts. The script below stands up two heartbeat-style detectors, drives them with simulated heartbeats over a healthy network, then injects a 30 ms forward jump in one node's clock and shows the detector falsely declaring the peer dead.
# wall_clock_breaks_detector.py — a heartbeat detector that fails under clock skew
import time, random, threading
from dataclasses import dataclass, field
from collections import deque
@dataclass
class Detector:
"""Heartbeat-based detector: peer is suspect if no heartbeat in T_SUSPECT seconds.
Critically, this uses time.time() (wall-clock) rather than time.monotonic()."""
name: str
peer: str
t_suspect: float = 0.150 # 150 ms suspect threshold
last_heartbeat_at: float = 0.0
clock_offset: float = 0.0 # injected skew in seconds
log: list = field(default_factory=list)
def now(self):
# The bug: real systems often forget that wall-clock is not monotonic
return time.time() + self.clock_offset
def receive_heartbeat(self):
self.last_heartbeat_at = self.now()
def tick(self):
gap = self.now() - self.last_heartbeat_at
verdict = "SUSPECT" if gap > self.t_suspect else "ALIVE"
self.log.append((round(gap * 1000, 1), verdict))
return verdict
def simulate(skew_ms: float, ticks: int = 30, hb_period_ms: int = 50):
d = Detector("n0", "n1")
d.last_heartbeat_at = d.now()
skew_injected = False
for i in range(ticks):
# Heartbeat arrives every hb_period_ms (no network loss in this run)
if i % (hb_period_ms // 10) == 0:
d.receive_heartbeat()
# Inject forward clock jump halfway through the run
if i == ticks // 2 and not skew_injected:
d.clock_offset += skew_ms / 1000.0
skew_injected = True
d.log.append(("CLOCK_JUMP", f"+{skew_ms} ms"))
d.tick()
time.sleep(0.010) # 10 ms wall ticks
return d.log
if __name__ == "__main__":
print("=== healthy run, no clock skew ===")
for entry in simulate(skew_ms=0)[-8:]:
print(" ", entry)
print("\n=== same run, +200 ms forward clock jump injected mid-stream ===")
for entry in simulate(skew_ms=200)[-8:]:
print(" ", entry)
Sample run:
=== healthy run, no clock skew ===
(10.5, 'ALIVE')
(20.7, 'ALIVE')
(0.6, 'ALIVE')
(10.4, 'ALIVE')
(20.8, 'ALIVE')
(30.6, 'ALIVE')
(40.7, 'ALIVE')
(0.4, 'ALIVE')
=== same run, +200 ms forward clock jump injected mid-stream ===
('CLOCK_JUMP', '+200 ms')
(210.6, 'SUSPECT')
(220.4, 'SUSPECT')
(230.5, 'SUSPECT')
(0.7, 'ALIVE')
(10.4, 'ALIVE')
(20.5, 'ALIVE')
(30.7, 'ALIVE')
Read the second block. The peer never failed. The peer is sending heartbeats every 50 ms exactly as before; the network is delivering them; the receive_heartbeat() method runs on schedule. The single thing that changed is that this node's wall-clock jumped forward by 200 ms — the kind of step a misconfigured NTP daemon, a virtualised host's clock-source switch, or a leap-second handler can produce in a single syscall. The detector's now() jumps with it, the gap computation sees 200 ms instead of <10 ms, and the detector's verdict flips to SUSPECT for three consecutive ticks. From the perspective of the cluster's membership protocol, this node has just announced that the leader is suspect. From the perspective of the leader, nothing has happened. The verdict is a property of this node's clock, not of the leader's behaviour.
Why a 200 ms skew is realistic, not a strawman: NTP daemons typically configure with tinker step 0.128 (the default), meaning they refuse to step the clock by more than 128 ms — but the sum of small slews over hours can produce a cumulative skew that exceeds the protocol's threshold. More dramatically, Linux's KVM virtual hosts can experience sudden clock jumps when the hypervisor's own clock source changes (e.g. a switch from TSC to HPET to KVM-clock at host migration), and reported jumps of 100–500 ms have appeared in field reports. Cassandra, Riak, and Elasticsearch operators have all written incident post-mortems describing clusters that fragmented during exactly such an event — the network was fine, the nodes were fine, the clocks disagreed and the cluster fragmented along the disagreement.
Why fixing this is harder than "use monotonic clocks": monotonic clocks solve the interval problem (how much time has passed on this machine), but heartbeats are sent from another machine and need to be ordered relative to each other across machines. The receiving side needs to ask "did the heartbeat for round 47 arrive before the heartbeat for round 48?", and that comparison crosses machine boundaries. If the sender stamps each heartbeat with its own monotonic clock, the receiver cannot interpret the value because monotonic clocks have arbitrary origins. If the sender uses wall-clock, you are back in the broken regime. The bridge is logical clocks (Lamport timestamps, vector clocks, hybrid logical clocks) — which is precisely the agenda for Part 3.
Time appears in places you do not expect
The detector example is the obvious case. The wall actually extends much further — almost every distributed-systems mechanism reaches for time somewhere, and most of those reaches are silent in the protocol description. Three categories of hidden time-dependence are worth pulling out before Part 3 starts.
Causal ordering pretends to be timeless but is not. "Event A happened before event B" sounds like a logical predicate, independent of clocks. It is not — every implementation of "happened-before" needs a clock to mark events with. Lamport timestamps use a logical clock, but that logical clock is incremented on every send/receive, and the implementation runs on a real machine whose ability to increment a counter atomically depends on memory coherence, which depends on cache-line states, which depend on... wall-clock-driven refresh cycles. The abstraction removes wall-clock from the protocol's logic, but it does not remove it from the physical substrate; it merely localises the dependence. This matters when debugging: a Lamport-timestamped log that shows event A's timestamp 7 and event B's timestamp 8 does not, by itself, tell you that A happened before B in real time — only that A's process knew about its own clock value 7 before some message from another process arrived that bumped the local clock to 8.
Idempotency keys have a TTL, and the TTL is wall-clock. Idempotency-based de-duplication — the standard answer to "what if a retry resends a payment?" — works by storing the idempotency key for a duration and checking it on each request. That duration is a wall-clock TTL: expires_at = now() + 24h. If two replicas of the de-dup store have skewed clocks by 30 minutes, a key that one replica has expired is still live on the other; a retry routed to the wrong replica may get processed twice. Real implementations (DynamoDB's idempotency table, Stripe's idempotency keys) sidestep this by centralising the de-dup store rather than replicating it across regions, but the moment you replicate de-dup state you re-introduce clock dependence.
Garbage collection of distributed state assumes time bounds. Tombstones in Cassandra, deletion markers in S3, vector-clock pruning in Riak — every distributed-storage system that supports deletion has to eventually remove the deletion record itself, otherwise tombstones accumulate forever. The standard mechanism is "after GC grace period of T (default 10 days), tombstones can be physically deleted". T is wall-clock. If a partition lasts longer than T and a node returns from the partition with state older than T ago, the system can resurrect deleted data — Cassandra's "zombie data" failure mode is exactly this. The protocol's correctness depends on the partition duration being shorter than T, which is a wall-clock claim that no failure model in Part 2 has the language to express.
The pattern across all three: wherever a protocol claims to "eventually" do something, "eventually" is a wall-clock measurement in disguise. Eventual consistency means "consistent within some bounded time"; the bound is a wall-clock interval; the wall-clock is the thing Part 2 has been silent about. Part 3 makes the silence visible.
A production incident — CricStream's leader-lease miscount, July 2024
CricStream runs its session-routing layer on a 5-node Raft cluster behind a Layer-7 load balancer. Each leader holds a 5-second lease over the routing-table partition; the lease lets the leader serve reads without a quorum round-trip. The lease is "valid until t = leader_clock_at_grant + 5000 ms", computed using each follower's local clock. Followers refuse to elect a new leader until their own clock reads past t, on the assumption that even with skew, the lease is over by the time every follower's clock has crossed t.
On 12 July 2024 at 19:15 IST — eight minutes before the IPL Eliminator final's first ball — a host-level kernel upgrade on three of the five Raft nodes finished its scheduled rolling reboot. The upgrade had inadvertently swapped the systemd-timesyncd daemon for chronyd; chronyd, on first start, queried the upstream pool and applied a step correction of +180 ms to the host clock (the previous timesyncd had been slowly slewing for 36 hours and the offset had grown). On the three rebooted nodes, the wall-clock jumped forward by 180 ms in a single syscall. On the two non-rebooted nodes, the clock did not jump.
The leader at the time was one of the non-rebooted nodes. It granted a 5-second lease at t_grant_leader = 19:23:11.000 (its own clock). The three rebooted followers received the lease grant and stored expiry = 19:23:16.000 (their interpretation). At wall-time 19:23:13.820 — just under three seconds into the lease — the rebooted followers' clocks read 19:23:14.000 (180 ms ahead). At wall-time 19:23:15.820 they read 19:23:16.000. The followers' code checked "is now() >= expiry?", got true 180 ms early, and a follower triggered an election thinking the lease had expired. The leader, whose clock had not jumped, was still happily serving reads under what it thought was a valid lease. For 180 ms the cluster had two leaders.
During those 180 ms, the load balancer routed two writes for two different user sessions to the two leaders simultaneously, both leaders accepted them, both leaders attempted replication, and the replication conflict resolved when the new leader (with the higher term) ultimately won — discarding 14 in-flight writes from the old leader. The 14 writes corresponded to 14 viewer-session updates: 12 of them were idle session-pings and lost-without-impact; 2 were the assignment of premium-quality stream URLs to two viewers in Hyderabad whose sessions ended up routed to a stale URL. Both viewers saw a 4-second buffering pause when the URL became invalid 7 seconds later.
The post-incident root cause was filed as "rolling kernel upgrade triggered NTP-daemon swap with step correction; lease-expiry calculation does not account for cross-node clock-step skew". The fix was twofold: (1) extend lease grants by an additional max_expected_skew_ms (set to 250 ms for this fleet) so a follower considers a lease expired only after expiry + max_skew_ms of its own clock, and (2) move the comparison from wall-clock to a hybrid-logical-clock value where the leader's grant carries the leader's HLC and the follower advances its HLC monotonically on receipt. The hybrid-logical-clock fix is exactly what Part 3, chapter 17 ("Hybrid logical clocks (HLC)") is about.
This incident is in the catalogue of every distributed-systems-team's cautionary stories because it has the structural property that no individual component failed. The kernel upgrade was applied correctly. NTP daemons did their job correctly. Raft's election logic was correct. The lease arithmetic was correct given the wall-clock as input. The system fragmented because the wall-clock is not a reliable input — the wall this chapter names.
Common confusions
-
"Monotonic clocks fix this; just use
time.monotonic()everywhere." Monotonic clocks fix the interval problem on a single machine — they preventnow() - then()from going negative. They do not fix the cross-machine problem because monotonic clocks have arbitrary origins; comparing one machine's monotonic value to another's is meaningless. The lease problem requires either a coordinated clock (TrueTime) or a logical clock (HLC), not a stronger local clock. -
"NTP synchronisation makes wall-clocks reliable enough." NTP achieves typical accuracy of 1–10 ms on a healthy LAN and 10–100 ms across continents, but step corrections (sudden jumps) can be tens to hundreds of milliseconds when a daemon restarts, a clock source changes, or upstream pool servers drift. NTP's bound on accuracy is a probabilistic claim, not a guarantee — and protocols that bake in a 5 ms skew assumption (Spanner's TrueTime is 7 ms p99) require dedicated atomic-clock and GPS-receiver infrastructure to make the claim hold. Default NTP is not that.
-
"Logical clocks (Lamport, vector) make the wall-clock issue go away." Logical clocks make event ordering go away as a wall-clock problem; they do not address timeouts, leases, or any "wait for T seconds" mechanism, all of which still need a physical-time measurement. A working distributed system uses logical clocks for ordering and physical clocks for waiting, and the two systems coexist — they do not substitute for each other.
-
"Just make timeouts longer; the skew goes into the slack." Longer timeouts trade tail latency for false-positive avoidance. A 30-second heartbeat timeout absorbs 200 ms of skew comfortably, but it also means a real failure takes 30 seconds to detect — during which the cluster's availability is degraded for any workload that depended on the failed node. The "make timeouts longer" answer is correct but partial; it pushes the trade-off, it does not eliminate it. Part 10 (failure detection) and Part 9 (leases) are about the principled versions of this trade-off.
-
"Spanner's TrueTime solves time once and for all." TrueTime is a bounded uncertainty clock —
tt.now()returns an interval[earliest, latest], not a point. Spanner'scommit-waitprotocol then waits out the uncertainty by sleeping(latest - earliest)before acknowledging a write. This works only because Google operates a fleet of GPS receivers and atomic clocks across data centres to keep the interval to 7 ms p99. Outside Google's infrastructure (and a few cloud-provider clones), most distributed systems cannot afford the hardware investment, which is why HLC and Lamport variants dominate — they are the practical answer when you cannot have TrueTime. -
"This chapter says clocks are unreliable; therefore consensus is impossible." Clocks are unreliable for some questions ("what is the absolute current time on the leader's wall-clock?") and reliable for other questions ("has at least one second passed since the last event on this CPU?"). Distributed protocols are designed around the reliable questions and against the unreliable ones — Raft's election timeout uses intervals (reliable on monotonic clocks), Spanner's commit-wait uses bounded uncertainty (reliable when measured), Lamport timestamps use advance-on-event (reliable by construction). Consensus is achievable; what is not achievable is consensus that depends on unbounded-precision wall-clock agreement, which is why no real protocol asks for that.
Going deeper
Why Part 2 deliberately avoided time until now
A pedagogical choice runs through this curriculum: Part 1 introduces the forces that make distributed systems necessary (scale, physics, availability), Part 2 introduces the failure surface (partial failures, network failure modes, slow nodes), and Part 3 introduces time and order. The order is not arbitrary. Part 2's failure-mode taxonomy is structurally complete on its own — it identifies what can break — and explicitly refuses to address how you detect what broke. That refusal is deliberate, because every detector is a clock, and clocks are a coherent topic that deserves its own part rather than being smeared across failure-detection examples. By the end of Part 2 you should have an uncomfortable feeling: I now know all the ways the system can fail, but I do not know how I would tell. That feeling is the wall. Part 3 is the demolition.
The Chubby paper's line about clocks
Mike Burrows's 2006 OSDI paper on Chubby — Google's distributed lock service — has a single sentence in §2.4 that frames the time problem more directly than any of the textbook treatments: "Clients must extend their leases before they expire, but the leader uses local time to decide when leases have expired." The sentence is buried in a paragraph about lease management, but it carries the entire weight of this chapter. The leader's notion of "the lease has expired" is determined by its clock; the client's notion of "I should renew now" is determined by its clock. If the two clocks disagree, the renewal can arrive after the leader has already considered the lease expired — and at that moment, the leader believes the client is dead, while the client believes itself alive. Chubby's specific defence is to add a grace period to lease grants and to refuse to grant a lease less than that grace period long, but the deeper point is that even a service whose primary product is "agreement on time-bounded leases" has to build the agreement against unreliable clocks. The protocol is not made-up safety theatre; it is the load-bearing structure that lets a useful service exist on top of clocks that lie.
What FLP impossibility says about time, exactly
The Fischer-Lynch-Paterson impossibility result (1985) says: in an asynchronous network where one process can crash, no deterministic consensus protocol can guarantee termination. The "asynchronous" qualifier is the key — FLP's network has no timing guarantees at all; messages can be delayed by an arbitrarily long but finite amount. In this model, you cannot tell apart "the message is in flight, slow" from "the sender crashed". Real consensus protocols (Paxos, Raft) circumvent FLP not by violating its proof — the proof is correct — but by assuming a partial-synchrony model: messages are delivered within some bound Δ eventually, even if Δ is unknown initially. Under partial synchrony, randomised timeouts plus a high-enough T recover liveness; the cluster may take longer to elect a leader during a temporary asynchrony spike, but it does eventually elect one. The price of liveness is exactly this clock-based detector — and the FLP result is the formal statement that without it, you do not get consensus at all. Part 8 (consensus) returns to this in detail; for now, the takeaway is that the wall this chapter names is the same wall that the deepest impossibility result in the field names. It is not a curriculum-organisation choice; it is the physical structure of the problem.
Reproduce this on your laptop
# Reproduce the wall-clock detector breakage
python3 -m venv .venv && source .venv/bin/activate
python3 wall_clock_breaks_detector.py
# Edit the skew_ms value to find your detector's tipping point
# Try 50, 100, 150, 200; observe at what skew the verdict flips
# See your own machine's clock-step rate via NTP queries
chronyc tracking # if you run chrony
ntpq -p # if you run ntpd
# Both will report current offset and last-step magnitude
# Demonstrate the monotonic-vs-wall difference directly
python3 -c "import time; w0=time.time(); m0=time.monotonic(); time.sleep(1); print(f'wall delta: {time.time()-w0:.6f}, mono delta: {time.monotonic()-m0:.6f}')"
Where this leads next
Part 3 picks up the gauntlet this chapter throws down. Each chapter in Part 3 is a specific weapon for a specific clock failure mode:
- Wall clocks and NTP — what the OS-level wall-clock actually is, what NTP does and does not guarantee, when steps happen.
- Monotonic clocks — the answer to interval measurement; why every well-debugged protocol uses these for timeouts.
- Logical clocks (Lamport) — the canonical answer to "happened-before" without depending on physical time.
- Vector clocks — the extension that captures concurrency, not just order.
- Hybrid logical clocks (HLC) — the practical synthesis: logical clocks that also track wall-clock loosely, used by CockroachDB, MongoDB, YugabyteDB.
- TrueTime and Spanner's bounded uncertainty model — the only deployed system that gives you globally-consistent wall-clock with proven bounds, and what it costs.
The short version of Part 3's promise: by the end of it, you can write a failure detector that does not lie when one node's clock jumps, a lease that does not split-brain when followers and leader disagree about wall-time, and a causal-ordering scheme that survives a six-hour partition. None of this is possible inside Part 2's vocabulary, which is why Part 2 ends here. The wall is real; Part 3 is what gets you over it.
References
- Time, Clocks, and the Ordering of Events in a Distributed System — Lamport, CACM 1978. The foundational paper that makes "happened-before" precise without wall-clock dependence.
- The Chubby Lock Service for Loosely-Coupled Distributed Systems — Burrows, OSDI 2006. The lease-management paper whose §2.4 quote on clocks captures the wall this chapter names.
- Spanner: Google's Globally-Distributed Database — Corbett et al., OSDI 2012. The TrueTime architecture and its bounded-uncertainty interval.
- Impossibility of Distributed Consensus with One Faulty Process — Fischer, Lynch, Paterson, JACM 1985. The asynchrony-vs-consensus result that frames why timeouts exist at all.
- Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases — Kulkarni et al., 2014. The hybrid-logical-clock formalisation used by CockroachDB and YugabyteDB.
- How real networks actually fail (studies) — internal cross-link to the Part 2 chapter that catalogues the network-side failures this chapter pairs with on the time-side.
- Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8 ("The Trouble with Distributed Systems") synthesises clocks, networks, and partial failures for a practitioner audience.
- Hybrid Logical Clocks in CockroachDB — Cockroach Labs engineering blog. The deployed-system view of why HLC is the practical answer when TrueTime is not available.