Wall: nothing works without time

It is 03:11 and Karan is staring at three screens. The leader of PaySetu's settlement-ledger Raft group has just been declared dead by two of its four followers and alive by the other two. The split is not random — it is exactly along the boundary of which followers run on hosts whose NTP daemon last syncronised before the data centre's GPS antenna started reporting LEAP_NOT_IN_SYNC at 02:53. The followers whose clocks drifted forward by 14 ms decided their election timeout had expired; the followers whose clocks drifted backward by 9 ms still saw the leader as healthy. The leader itself is fine — its process is alive, its disk is fsync'ing, its replication thread is sending AppendEntries every 50 ms exactly as designed. The cluster has fractured not because the network failed, not because a node crashed, not because anyone deployed bad code, but because four machines stopped agreeing on what time it was. Karan opens the Raft protocol paper and re-reads the election-timeout section. Every sentence in it presupposes a clock. Every sentence in it lies.

Part 2 of this curriculum has built a complete failure model — partial failures, fail-stop / fail-slow / fail-silent, asymmetric partitions, gray failures. Every one of those models assumes a thing it never names: a working clock. Failure detection is "no message in T seconds"; T is a clock measurement. Retries are "wait, then resend"; the wait is a clock measurement. Leader leases are "valid until time X"; X is a clock. The wall this chapter names is that the failure-models toolkit is structurally incomplete until time itself becomes a first-class problem — which is what Part 3 is about.

Every failure-detection primitive is a clock with a threshold

Open any failure-detection literature and the algorithms read as decisions made about messages. "If a heartbeat does not arrive, mark the peer suspect." "If three consecutive heartbeats are missed, mark the peer failed." "If the response time exceeds the deadline, abort the call." Each of these sentences contains a hidden clause: as measured by which clock?

A heartbeat is sent at t_send on the sender's clock. It arrives at t_recv on the receiver's clock. The receiver's failure detector watches the gap t_recv_now - t_recv_last and trips when that gap exceeds a threshold T. Three things can make this trip incorrectly without any actual failure: (1) the network delivers the packet but the receiver's clock has jumped forward (NTP step, virtual machine pause, leap-second handling), so t_recv_now is artificially large; (2) the sender's clock has jumped backward, so the sender's heartbeat scheduling falls behind real time and the receiver sees a real gap that the sender did not intend; (3) the network really delayed the packet, but by less than T, while the receiver's clock simultaneously jumped forward, producing a combined gap that crosses the threshold. The detector cannot distinguish among these three causes because it only sees the gap, not the cause.

The T threshold is itself a clock measurement. When you tune Raft's election timeout to "150–300 ms", you are saying "150 ms as observed by the candidate's monotonic clock". If the candidate's monotonic clock drifts by 100 ppm (typical commodity quartz), 150 ms of intended timeout becomes 149.985 ms or 150.015 ms — usually irrelevant, but during a CPU temperature swing or a virtualised host's clock-source change, drift can briefly hit 1000 ppm and the actual interval can be 145 ms or 155 ms. In a tight election race this matters: the follower with a 145 ms effective timeout will start its election before the follower with a 155 ms effective timeout, every single time the cluster is under stress, not randomly as the protocol's randomised-timeout argument assumes.

The deeper observation: the failure-models language of "the node failed at time T" is meaningless without specifying whose clock measured T. Two nodes that disagree about whether a third node failed are not necessarily disagreeing about the third node — they may be disagreeing about time, and the third node is the prop on which their clock-disagreement projects. Part 2 has been silent about this because Part 2 is about failure modes; the moment you try to implement a detector for those modes, time becomes the load-bearing wall.

Failure detection primitives map directly to clock measurementsA two-column diagram. Left column lists five failure-detection primitives: heartbeat timeout, request deadline, lease expiry, election timeout, gossip suspicion. Right column shows the corresponding clock-measurement underneath each: difference of two clock readings, absolute clock value at deadline, monotonic clock past lease end, monotonic timer, time-since-last-message. Each primitive has an arrow to its clock dependency.Every failure-detection primitive is a clock measurement underneath Failure-detection primitive What the clock has to do heartbeat timeout "no probe in 3 intervals = suspect" delta between two readings of the local monotonic clock request deadline "abort if not done by 09:00:01.250" absolute wall-clock value compared to NTP-synced now() leader lease "I am leader until t = 09:00:05" monotonic clock past lease end + skew bound across replicas election timeout "if no heartbeat in 150–300 ms, run" monotonic timer that did not jump or pause during a GC gossip suspicion "phi-accrual exceeded 8.0" distribution of inter-arrival deltas — clock-deltas, again
Illustrative — every failure detector you ship is a clock measurement with a threshold. Part 3 is the chapter that asks what those measurements actually mean.

Why monotonic vs wall-clock matters here specifically: a wall clock can step backward when NTP corrects drift — the kernel can suddenly report 09:00:00.000 after just reporting 09:00:00.014, because NTP decided the local clock was 14 ms ahead of the canonical one and slewed it. If your election-timeout code uses wall-clock subtraction, it can suddenly compute a negative elapsed time, which the conditional then treats as "no time has passed" — the timer never trips, the failed leader stays "alive" forever from that follower's view. Monotonic clocks (Linux CLOCK_MONOTONIC, Go's time.Since, Python's time.monotonic) do not step backward, which is why every protocol implementation that has been debugged in production uses them for intervals and reserves wall-clock for timestamps that need to be compared across machines. The cost is that you cannot ask "is the lease still valid?" using monotonic time when the lease was issued by a different machine with its own monotonic origin — that is the gap Part 3 spends three chapters bridging.

Demonstrating the failure: a 70-line Python script that breaks under clock skew

The cleanest way to see the wall is to write the simplest possible failure detector, run it across two simulated nodes, then introduce clock skew between them and watch it produce wrong verdicts. The script below stands up two heartbeat-style detectors, drives them with simulated heartbeats over a healthy network, then injects a 30 ms forward jump in one node's clock and shows the detector falsely declaring the peer dead.

# wall_clock_breaks_detector.py — a heartbeat detector that fails under clock skew
import time, random, threading
from dataclasses import dataclass, field
from collections import deque

@dataclass
class Detector:
    """Heartbeat-based detector: peer is suspect if no heartbeat in T_SUSPECT seconds.
    Critically, this uses time.time() (wall-clock) rather than time.monotonic()."""
    name: str
    peer: str
    t_suspect: float = 0.150          # 150 ms suspect threshold
    last_heartbeat_at: float = 0.0
    clock_offset: float = 0.0         # injected skew in seconds
    log: list = field(default_factory=list)

    def now(self):
        # The bug: real systems often forget that wall-clock is not monotonic
        return time.time() + self.clock_offset

    def receive_heartbeat(self):
        self.last_heartbeat_at = self.now()

    def tick(self):
        gap = self.now() - self.last_heartbeat_at
        verdict = "SUSPECT" if gap > self.t_suspect else "ALIVE"
        self.log.append((round(gap * 1000, 1), verdict))
        return verdict

def simulate(skew_ms: float, ticks: int = 30, hb_period_ms: int = 50):
    d = Detector("n0", "n1")
    d.last_heartbeat_at = d.now()
    skew_injected = False
    for i in range(ticks):
        # Heartbeat arrives every hb_period_ms (no network loss in this run)
        if i % (hb_period_ms // 10) == 0:
            d.receive_heartbeat()
        # Inject forward clock jump halfway through the run
        if i == ticks // 2 and not skew_injected:
            d.clock_offset += skew_ms / 1000.0
            skew_injected = True
            d.log.append(("CLOCK_JUMP", f"+{skew_ms} ms"))
        d.tick()
        time.sleep(0.010)  # 10 ms wall ticks
    return d.log

if __name__ == "__main__":
    print("=== healthy run, no clock skew ===")
    for entry in simulate(skew_ms=0)[-8:]:
        print(" ", entry)
    print("\n=== same run, +200 ms forward clock jump injected mid-stream ===")
    for entry in simulate(skew_ms=200)[-8:]:
        print(" ", entry)

Sample run:

=== healthy run, no clock skew ===
  (10.5, 'ALIVE')
  (20.7, 'ALIVE')
  (0.6, 'ALIVE')
  (10.4, 'ALIVE')
  (20.8, 'ALIVE')
  (30.6, 'ALIVE')
  (40.7, 'ALIVE')
  (0.4, 'ALIVE')

=== same run, +200 ms forward clock jump injected mid-stream ===
  ('CLOCK_JUMP', '+200 ms')
  (210.6, 'SUSPECT')
  (220.4, 'SUSPECT')
  (230.5, 'SUSPECT')
  (0.7, 'ALIVE')
  (10.4, 'ALIVE')
  (20.5, 'ALIVE')
  (30.7, 'ALIVE')

Read the second block. The peer never failed. The peer is sending heartbeats every 50 ms exactly as before; the network is delivering them; the receive_heartbeat() method runs on schedule. The single thing that changed is that this node's wall-clock jumped forward by 200 ms — the kind of step a misconfigured NTP daemon, a virtualised host's clock-source switch, or a leap-second handler can produce in a single syscall. The detector's now() jumps with it, the gap computation sees 200 ms instead of <10 ms, and the detector's verdict flips to SUSPECT for three consecutive ticks. From the perspective of the cluster's membership protocol, this node has just announced that the leader is suspect. From the perspective of the leader, nothing has happened. The verdict is a property of this node's clock, not of the leader's behaviour.

Why a 200 ms skew is realistic, not a strawman: NTP daemons typically configure with tinker step 0.128 (the default), meaning they refuse to step the clock by more than 128 ms — but the sum of small slews over hours can produce a cumulative skew that exceeds the protocol's threshold. More dramatically, Linux's KVM virtual hosts can experience sudden clock jumps when the hypervisor's own clock source changes (e.g. a switch from TSC to HPET to KVM-clock at host migration), and reported jumps of 100–500 ms have appeared in field reports. Cassandra, Riak, and Elasticsearch operators have all written incident post-mortems describing clusters that fragmented during exactly such an event — the network was fine, the nodes were fine, the clocks disagreed and the cluster fragmented along the disagreement.

Why fixing this is harder than "use monotonic clocks": monotonic clocks solve the interval problem (how much time has passed on this machine), but heartbeats are sent from another machine and need to be ordered relative to each other across machines. The receiving side needs to ask "did the heartbeat for round 47 arrive before the heartbeat for round 48?", and that comparison crosses machine boundaries. If the sender stamps each heartbeat with its own monotonic clock, the receiver cannot interpret the value because monotonic clocks have arbitrary origins. If the sender uses wall-clock, you are back in the broken regime. The bridge is logical clocks (Lamport timestamps, vector clocks, hybrid logical clocks) — which is precisely the agenda for Part 3.

Time appears in places you do not expect

The detector example is the obvious case. The wall actually extends much further — almost every distributed-systems mechanism reaches for time somewhere, and most of those reaches are silent in the protocol description. Three categories of hidden time-dependence are worth pulling out before Part 3 starts.

Causal ordering pretends to be timeless but is not. "Event A happened before event B" sounds like a logical predicate, independent of clocks. It is not — every implementation of "happened-before" needs a clock to mark events with. Lamport timestamps use a logical clock, but that logical clock is incremented on every send/receive, and the implementation runs on a real machine whose ability to increment a counter atomically depends on memory coherence, which depends on cache-line states, which depend on... wall-clock-driven refresh cycles. The abstraction removes wall-clock from the protocol's logic, but it does not remove it from the physical substrate; it merely localises the dependence. This matters when debugging: a Lamport-timestamped log that shows event A's timestamp 7 and event B's timestamp 8 does not, by itself, tell you that A happened before B in real time — only that A's process knew about its own clock value 7 before some message from another process arrived that bumped the local clock to 8.

Idempotency keys have a TTL, and the TTL is wall-clock. Idempotency-based de-duplication — the standard answer to "what if a retry resends a payment?" — works by storing the idempotency key for a duration and checking it on each request. That duration is a wall-clock TTL: expires_at = now() + 24h. If two replicas of the de-dup store have skewed clocks by 30 minutes, a key that one replica has expired is still live on the other; a retry routed to the wrong replica may get processed twice. Real implementations (DynamoDB's idempotency table, Stripe's idempotency keys) sidestep this by centralising the de-dup store rather than replicating it across regions, but the moment you replicate de-dup state you re-introduce clock dependence.

Garbage collection of distributed state assumes time bounds. Tombstones in Cassandra, deletion markers in S3, vector-clock pruning in Riak — every distributed-storage system that supports deletion has to eventually remove the deletion record itself, otherwise tombstones accumulate forever. The standard mechanism is "after GC grace period of T (default 10 days), tombstones can be physically deleted". T is wall-clock. If a partition lasts longer than T and a node returns from the partition with state older than T ago, the system can resurrect deleted data — Cassandra's "zombie data" failure mode is exactly this. The protocol's correctness depends on the partition duration being shorter than T, which is a wall-clock claim that no failure model in Part 2 has the language to express.

The pattern across all three: wherever a protocol claims to "eventually" do something, "eventually" is a wall-clock measurement in disguise. Eventual consistency means "consistent within some bounded time"; the bound is a wall-clock interval; the wall-clock is the thing Part 2 has been silent about. Part 3 makes the silence visible.

Three places where wall-clock dependence hides inside abstractions that look timelessThree rows. Each shows an abstraction on the left (causal ordering, idempotency keys, distributed garbage collection) with an arrow pointing to the wall-clock dependence on the right (logical clock advancement bound, TTL wall-clock expiry, GC grace period in wall-seconds)."Timeless" abstractions whose wall-clock dependence is hidden causal ordering "A happened-before B" — logical, not wall-time but underneath: logical-clock advance bound runs on physical hardware whose tick rate is wall-clock idempotency keys de-dup retries — looks pure until the key expires but underneath: TTL = expires_at = now() + 24h replica clock skew → key expires inconsistently distributed GC tombstones, deletion markers need bounded retention but underneath: GC grace = 10 days wall-clock partition longer than grace → zombie data resurrects
Illustrative — three abstractions that present as time-independent but break when wall-clock skew is introduced. Part 3's logical-clock arsenal exists because of these silent dependencies.

A production incident — CricStream's leader-lease miscount, July 2024

CricStream runs its session-routing layer on a 5-node Raft cluster behind a Layer-7 load balancer. Each leader holds a 5-second lease over the routing-table partition; the lease lets the leader serve reads without a quorum round-trip. The lease is "valid until t = leader_clock_at_grant + 5000 ms", computed using each follower's local clock. Followers refuse to elect a new leader until their own clock reads past t, on the assumption that even with skew, the lease is over by the time every follower's clock has crossed t.

On 12 July 2024 at 19:15 IST — eight minutes before the IPL Eliminator final's first ball — a host-level kernel upgrade on three of the five Raft nodes finished its scheduled rolling reboot. The upgrade had inadvertently swapped the systemd-timesyncd daemon for chronyd; chronyd, on first start, queried the upstream pool and applied a step correction of +180 ms to the host clock (the previous timesyncd had been slowly slewing for 36 hours and the offset had grown). On the three rebooted nodes, the wall-clock jumped forward by 180 ms in a single syscall. On the two non-rebooted nodes, the clock did not jump.

The leader at the time was one of the non-rebooted nodes. It granted a 5-second lease at t_grant_leader = 19:23:11.000 (its own clock). The three rebooted followers received the lease grant and stored expiry = 19:23:16.000 (their interpretation). At wall-time 19:23:13.820 — just under three seconds into the lease — the rebooted followers' clocks read 19:23:14.000 (180 ms ahead). At wall-time 19:23:15.820 they read 19:23:16.000. The followers' code checked "is now() >= expiry?", got true 180 ms early, and a follower triggered an election thinking the lease had expired. The leader, whose clock had not jumped, was still happily serving reads under what it thought was a valid lease. For 180 ms the cluster had two leaders.

During those 180 ms, the load balancer routed two writes for two different user sessions to the two leaders simultaneously, both leaders accepted them, both leaders attempted replication, and the replication conflict resolved when the new leader (with the higher term) ultimately won — discarding 14 in-flight writes from the old leader. The 14 writes corresponded to 14 viewer-session updates: 12 of them were idle session-pings and lost-without-impact; 2 were the assignment of premium-quality stream URLs to two viewers in Hyderabad whose sessions ended up routed to a stale URL. Both viewers saw a 4-second buffering pause when the URL became invalid 7 seconds later.

The post-incident root cause was filed as "rolling kernel upgrade triggered NTP-daemon swap with step correction; lease-expiry calculation does not account for cross-node clock-step skew". The fix was twofold: (1) extend lease grants by an additional max_expected_skew_ms (set to 250 ms for this fleet) so a follower considers a lease expired only after expiry + max_skew_ms of its own clock, and (2) move the comparison from wall-clock to a hybrid-logical-clock value where the leader's grant carries the leader's HLC and the follower advances its HLC monotonically on receipt. The hybrid-logical-clock fix is exactly what Part 3, chapter 17 ("Hybrid logical clocks (HLC)") is about.

This incident is in the catalogue of every distributed-systems-team's cautionary stories because it has the structural property that no individual component failed. The kernel upgrade was applied correctly. NTP daemons did their job correctly. Raft's election logic was correct. The lease arithmetic was correct given the wall-clock as input. The system fragmented because the wall-clock is not a reliable input — the wall this chapter names.

Common confusions

Going deeper

Why Part 2 deliberately avoided time until now

A pedagogical choice runs through this curriculum: Part 1 introduces the forces that make distributed systems necessary (scale, physics, availability), Part 2 introduces the failure surface (partial failures, network failure modes, slow nodes), and Part 3 introduces time and order. The order is not arbitrary. Part 2's failure-mode taxonomy is structurally complete on its own — it identifies what can break — and explicitly refuses to address how you detect what broke. That refusal is deliberate, because every detector is a clock, and clocks are a coherent topic that deserves its own part rather than being smeared across failure-detection examples. By the end of Part 2 you should have an uncomfortable feeling: I now know all the ways the system can fail, but I do not know how I would tell. That feeling is the wall. Part 3 is the demolition.

The Chubby paper's line about clocks

Mike Burrows's 2006 OSDI paper on Chubby — Google's distributed lock service — has a single sentence in §2.4 that frames the time problem more directly than any of the textbook treatments: "Clients must extend their leases before they expire, but the leader uses local time to decide when leases have expired." The sentence is buried in a paragraph about lease management, but it carries the entire weight of this chapter. The leader's notion of "the lease has expired" is determined by its clock; the client's notion of "I should renew now" is determined by its clock. If the two clocks disagree, the renewal can arrive after the leader has already considered the lease expired — and at that moment, the leader believes the client is dead, while the client believes itself alive. Chubby's specific defence is to add a grace period to lease grants and to refuse to grant a lease less than that grace period long, but the deeper point is that even a service whose primary product is "agreement on time-bounded leases" has to build the agreement against unreliable clocks. The protocol is not made-up safety theatre; it is the load-bearing structure that lets a useful service exist on top of clocks that lie.

What FLP impossibility says about time, exactly

The Fischer-Lynch-Paterson impossibility result (1985) says: in an asynchronous network where one process can crash, no deterministic consensus protocol can guarantee termination. The "asynchronous" qualifier is the key — FLP's network has no timing guarantees at all; messages can be delayed by an arbitrarily long but finite amount. In this model, you cannot tell apart "the message is in flight, slow" from "the sender crashed". Real consensus protocols (Paxos, Raft) circumvent FLP not by violating its proof — the proof is correct — but by assuming a partial-synchrony model: messages are delivered within some bound Δ eventually, even if Δ is unknown initially. Under partial synchrony, randomised timeouts plus a high-enough T recover liveness; the cluster may take longer to elect a leader during a temporary asynchrony spike, but it does eventually elect one. The price of liveness is exactly this clock-based detector — and the FLP result is the formal statement that without it, you do not get consensus at all. Part 8 (consensus) returns to this in detail; for now, the takeaway is that the wall this chapter names is the same wall that the deepest impossibility result in the field names. It is not a curriculum-organisation choice; it is the physical structure of the problem.

Reproduce this on your laptop

# Reproduce the wall-clock detector breakage
python3 -m venv .venv && source .venv/bin/activate
python3 wall_clock_breaks_detector.py
# Edit the skew_ms value to find your detector's tipping point
# Try 50, 100, 150, 200; observe at what skew the verdict flips

# See your own machine's clock-step rate via NTP queries
chronyc tracking          # if you run chrony
ntpq -p                   # if you run ntpd
# Both will report current offset and last-step magnitude

# Demonstrate the monotonic-vs-wall difference directly
python3 -c "import time; w0=time.time(); m0=time.monotonic(); time.sleep(1); print(f'wall delta: {time.time()-w0:.6f}, mono delta: {time.monotonic()-m0:.6f}')"

Where this leads next

Part 3 picks up the gauntlet this chapter throws down. Each chapter in Part 3 is a specific weapon for a specific clock failure mode:

The short version of Part 3's promise: by the end of it, you can write a failure detector that does not lie when one node's clock jumps, a lease that does not split-brain when followers and leader disagree about wall-time, and a causal-ordering scheme that survives a six-hour partition. None of this is possible inside Part 2's vocabulary, which is why Part 2 ends here. The wall is real; Part 3 is what gets you over it.

References

  1. Time, Clocks, and the Ordering of Events in a Distributed System — Lamport, CACM 1978. The foundational paper that makes "happened-before" precise without wall-clock dependence.
  2. The Chubby Lock Service for Loosely-Coupled Distributed Systems — Burrows, OSDI 2006. The lease-management paper whose §2.4 quote on clocks captures the wall this chapter names.
  3. Spanner: Google's Globally-Distributed Database — Corbett et al., OSDI 2012. The TrueTime architecture and its bounded-uncertainty interval.
  4. Impossibility of Distributed Consensus with One Faulty Process — Fischer, Lynch, Paterson, JACM 1985. The asynchrony-vs-consensus result that frames why timeouts exist at all.
  5. Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases — Kulkarni et al., 2014. The hybrid-logical-clock formalisation used by CockroachDB and YugabyteDB.
  6. How real networks actually fail (studies) — internal cross-link to the Part 2 chapter that catalogues the network-side failures this chapter pairs with on the time-side.
  7. Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8 ("The Trouble with Distributed Systems") synthesises clocks, networks, and partial failures for a practitioner audience.
  8. Hybrid Logical Clocks in CockroachDB — Cockroach Labs engineering blog. The deployed-system view of why HLC is the practical answer when TrueTime is not available.