Wall clocks and NTP

At 11:47 IST on a Tuesday in March, Aditi at PaySetu opens a Jupyter notebook to investigate why three audit-log entries from the settlement-ledger have timestamps 2026-03-17T11:47:03.412Z, 2026-03-17T11:47:03.317Z, 2026-03-17T11:47:03.398Z — in that order on disk, but going backwards in time twice. Same host, same process, same single-threaded writer. She runs chronyc tracking and gets Last offset: -0.094521 seconds. Her host's clock has just been slewed backwards by 94 ms over the last 60 seconds, and the writes that landed during the slew have wall-clock timestamps that decrease. Nothing is broken. Everything is working as designed. The audit-log timestamps are still nominally "correct" by the OS, by NTP, by every well-behaved Linux daemon. They just don't go forward.

A wall clock is not a clock — it is a high-frequency hardware counter (TSC, HPET, KVM-clock) whose rate is approximately known, multiplied by an offset that NTP keeps drifting toward a remote server's notion of UTC. NTP guarantees neither monotonicity nor accuracy bounds; it slews when the offset is small, steps when it is large, and at every step a measurable interval of time either repeats or vanishes. Every distributed-systems protocol that uses time.time() for ordering is making a bet that no slew or step occurs during the operation it cares about. Often it is a safe bet. Sometimes it is not.

What the kernel actually returns when you call time.time()

The Python statement time.time() looks atomic. It is not. On Linux the call resolves through the C library to clock_gettime(CLOCK_REALTIME, &ts), which the kernel implements by reading a hardware counter (typically the TSC — the CPU's Time Stamp Counter, a 64-bit register that increments every CPU cycle), scaling that count by a per-CPU calibration constant to get nanoseconds-since-boot, and adding a kernel-maintained offset called wall_to_monotonic to convert that into nanoseconds-since-the-Unix-epoch. The wall clock is derived, not measured directly. There is no oscillator inside your machine that ticks once per UTC nanosecond — there is a quartz crystal nominally at 100 MHz with a manufacturer tolerance of 50 parts per million, and a software story on top of it that pretends the result is UTC.

The hardware counter the kernel uses is selected at boot time from an ordered preference list. On most modern x86 hosts the list reads: TSC (preferred — fast, per-CPU, ~1 ns resolution), HPET (High Precision Event Timer, a chipset-level counter at ~14 MHz, slower to read), ACPI PM timer (3.58 MHz, very slow), and on virtualised hosts kvm-clock or pvclock (a paravirtualised view that the hypervisor publishes through a shared memory region). You can see your host's choice with cat /sys/devices/system/clocksource/clocksource0/current_clocksource. The choice matters because each source has different drift characteristics, different read costs (TSC is ~5 ns; HPET is ~600 ns), and different behaviours under sleep/migrate/suspend. A process that does 10 million time.time() calls during a tight loop will spend ~50 ms on TSC and ~6 seconds on HPET — a 120× slowdown invisible to the application but visible in a CPU flame graph.

The conversion from the hardware counter to wall-clock seconds is the load-bearing fiction. The kernel maintains two values it updates roughly every 1 ms (the tick): the current offset between counter ticks and nanoseconds (the mult factor), and the wall-clock baseline. Every clock_gettime(CLOCK_REALTIME) call reads the counter, multiplies by mult >> shift to get nanoseconds, and adds the baseline. NTP's job is to nudge the mult factor up or down so that the kernel's nanoseconds-per-tick estimate stays close to a "real" nanosecond, and to occasionally edit the baseline directly if the offset gets too large to slew out. The wall clock you see in the audit log is (hardware_ticks * mult / shift) + baseline + slew_correction_in_progress, and every term in that expression can move while you are reading it.

The five layers between a quartz oscillator and what time.time() returnsA vertical stack of five labelled boxes from bottom to top: quartz crystal (50 ppm tolerance), hardware counter (TSC / HPET / kvm-clock), kernel scaling (mult and shift), NTP-maintained offset, and finally clock_gettime(CLOCK_REALTIME). Arrows indicate that each layer adds uncertainty.From quartz to time.time(): five layers, each adds uncertainty quartz crystal oscillator ~100 MHz nominal, 50 ppm tolerance, drifts with temperature hardware counter (TSC / HPET / kvm-clock) monotonic ticks; per-CPU on TSC; can stop on suspend kernel scaling: ticks × (mult >> shift) → nanoseconds mult is recalibrated by NTP / PTP every tick NTP-maintained offset (wall_to_monotonic baseline) slewed when |offset| < 128 ms, stepped when larger clock_gettime(CLOCK_REALTIME) → time.time() what your application sees — sum of all four layers above
Illustrative — the wall clock is not a measurement; it is a stack of approximations. Every distributed-systems bug involving time lives at one of these layer boundaries.

Why TSC was historically dangerous and why it now is mostly fine: pre-2008 multi-socket x86 machines had per-CPU TSCs that ran independently and could drift relative to each other by tens of microseconds per second. A process migrated from CPU 0 to CPU 1 mid-syscall could observe time going backwards by hundreds of microseconds. Intel introduced "invariant TSC" with Nehalem in 2008, where the chipset ensures all sockets' TSCs are synchronously rebased, and modern Linux kernels detect this with the constant_tsc and nonstop_tsc CPU flags. If your host has both flags (check grep tsc /proc/cpuinfo), TSC is monotonic and uniform across cores; if it has only one or neither, the kernel should fall back to HPET, but in cloud / VM environments this fallback decision can flip across host migrations and produce sudden ~600 ns vs ~5 ns read-cost changes that look like application latency regressions.

NTP — what it actually does, and what it does not

NTP (Network Time Protocol) is what most Linux hosts use to keep the wall-clock baseline anchored to a remote source of UTC. The model is simple in shape and full of caveats in detail. The local NTP daemon (ntpd, chronyd, or systemd-timesyncd) periodically sends a packet to one or more upstream servers asking "what time is it?", measures the round-trip time of the response, estimates the one-way delay as half the round-trip, and computes the local-vs-remote offset. It then issues a correction to the kernel via adjtimex(2) that either slews the clock (gradually adjust the mult factor so the local clock catches up over the next several seconds) or steps it (atomically change the baseline by the offset amount).

The slew-vs-step decision is governed by a configuration parameter (tinker step in ntpd, makestep in chronyd). The default in modern installations is to slew if the offset is below 128 ms and to step if it is above. Slewing is gentle: it never makes the wall clock go backwards (the kernel guarantees this for adjtimex-driven slews — the mult factor is reduced, not the baseline), and it never produces an observable jump. But slewing has a rate limit of typically 500 ppm — so a 100 ms offset takes 200 seconds to slew out. During those 200 seconds, every time.time() reading on this host disagrees with every other host's reading by a measurable, decreasing amount. Stepping, by contrast, is instantaneous and can move the clock backwards: if your local clock is 200 ms ahead of the upstream source, a step subtracts 200 ms from the baseline, and the next time.time() reading is 200 ms earlier than the previous one.

NTP's accuracy guarantees are weak in a way most engineers do not appreciate. The protocol's stated accuracy is "tens of milliseconds across a public-internet connection" — a target, not a guarantee. The actual offset bound depends on (1) the round-trip time symmetry to the upstream server (NTP estimates one-way delay as half the RTT, which is wrong if the network is asymmetric), (2) the upstream server's own accuracy (Stratum 1 servers run on GPS receivers or atomic clocks; Stratum 2 are themselves NTP clients of Stratum 1; each layer adds uncertainty), and (3) the local daemon's smoothing-vs-responsiveness trade-off (a fast-responding daemon is jittery; a smooth one drifts). When PaySetu's settlement-ledger configures pool 0.in.pool.ntp.org, the host is talking to a Stratum 2 server in India whose own RTT to Stratum 1 might be 30 ms with 20% asymmetry, so the upstream offset is ±6 ms before this host's NTP daemon has even calculated anything. Layer the local network's jitter (typical 1–5 ms) and the daemon's smoothing (typical 1–10 ms slow to respond), and the practical accuracy on a healthy LAN is 5–20 ms, not the "millisecond-level" claim that gets quoted in marketing.

What NTP does not do is at least as important. It does not guarantee monotonicity (steps are allowed and routine on daemon startup). It does not quantify its own uncertainty (you can read chronyc tracking to get the current estimated offset, but there is no API saying "the wall clock is between 09:23:11.005 and 09:23:11.013" the way Spanner's TrueTime does). It does not detect malicious upstream servers (NTP authentication exists but is rarely deployed). It does not survive a network partition lasting longer than the daemon's maxdistance parameter (default ~16 minutes for chronyd); after that, the daemon enters "free-running" mode and the local clock drifts at the full quartz rate (50 ppm = 4.3 seconds per day). A host that has been partitioned from its NTP servers for 24 hours is, by default, ~4 seconds off UTC and has no way to tell you.

NTP slew vs step behaviour over timeA timeline plot showing offset from UTC on the y-axis vs wall-clock time on the x-axis. The line shows three phases: slow drift over 4 hours up to +180 ms offset, then a daemon restart event causing an instant downward step back to 0, with a dashed line annotated 'STEP' at the discontinuity. After the step, the line resumes slow drift.NTP slews while drift is small, steps when it is not 0 +90 ms +180 ms offset from UTC t = 0 t = 4 h t = 8 h STEP: -180 ms daemon restart applies makestep correction free-running drift ~50 ppm = 180 ms / 4 h slew resumes
Illustrative — under default chronyd settings the clock drifts freely up to ~180 ms, then a daemon restart triggers a step correction. Any audit-log timestamp written across the dashed line is non-monotonic.

Measuring your wall clock's behaviour with a 30-line probe

The clearest way to internalise wall-clock behaviour is to measure it. The script below samples both CLOCK_REALTIME and CLOCK_MONOTONIC at 100 Hz for 30 seconds, computes per-sample deltas, and flags any sample where the wall clock either went backwards or jumped by more than 5 ms in a single tick. On a healthy laptop with a quiet NTP daemon you will see no flags. On a host that just had chronyc makestep invoked, or a VM that just resumed from suspend, or a container whose clock source was just remapped by the hypervisor, you will see exactly the flags you expected to never see.

# probe_wall_clock.py — 100 Hz wall-vs-monotonic sampler with anomaly flagging
import time, statistics

def probe(duration_s: float = 30.0, hz: int = 100):
    period = 1.0 / hz
    samples = []
    last_wall = time.time()
    last_mono = time.monotonic()
    deadline = last_mono + duration_s
    while time.monotonic() < deadline:
        now_wall = time.time()
        now_mono = time.monotonic()
        d_wall = now_wall - last_wall
        d_mono = now_mono - last_mono
        skew = d_wall - d_mono       # positive: wall ran fast; negative: wall ran slow
        flag = ""
        if d_wall < 0:
            flag = "BACKWARDS"
        elif abs(d_wall - d_mono) > 0.005:
            flag = "JUMP"
        samples.append((round(d_wall * 1000, 3), round(d_mono * 1000, 3),
                        round(skew * 1e6, 1), flag))
        last_wall, last_mono = now_wall, now_mono
        time.sleep(period)
    return samples

if __name__ == "__main__":
    s = probe()
    flagged = [x for x in s if x[3]]
    skews_us = [x[2] for x in s]
    print(f"samples: {len(s)}  flagged: {len(flagged)}")
    print(f"skew p50/p99 (us): {statistics.median(skews_us):.1f} / "
          f"{sorted(skews_us)[int(0.99*len(skews_us))]:.1f}")
    for flag_event in flagged[:5]:
        print("  flag:", flag_event)

Sample run on a quiet host:

samples: 3000  flagged: 0
skew p50/p99 (us): 0.4 / 12.7

Sample run on the same host, with sudo chronyc makestep invoked at t=15 s:

samples: 3000  flagged: 1
skew p50/p99 (us): 0.4 / 96342.1
  flag: (-86.214, 9.998, -96212.3, 'BACKWARDS')

Read the second output. The 99th-percentile skew is 96 milliseconds — three orders of magnitude larger than the unstepped p99 — because a single sample crossed the step. The flagged sample shows the wall clock moving by -86.214 ms while the monotonic clock advanced by the expected 9.998 ms. Any audit-log entry written by another process during that 10 ms tick gets a timestamp 86 ms earlier than the entry written 10 ms before. The wall clock disagreed with itself.

Why the script uses both clocks: time.time() alone cannot detect a step, because after the step you only see the new value. By comparing against time.monotonic() — which the kernel guarantees is non-decreasing and unaffected by NTP slews or steps — you have a baseline that says "10 ms of wall-time should have passed". Any disagreement between the two is wall-clock anomaly. This is the same technique used in Cassandra's system_traces and Kafka's broker timestamps to detect clock anomalies in production: the broker writes both the wall-clock timestamp and the monotonic-derived "system time" to every record, and an out-of-band auditor flags records where the two disagree by more than a configurable threshold.

Why 100 Hz sampling matters: at 1 Hz you would catch the step (it is 86 ms — much larger than 1 second's quartz drift) but you would not see the resolution of the anomaly. At 100 Hz you can see that the step is concentrated in a single 10 ms window, which tells you it was a step correction, not a slew. If the script showed 50 consecutive samples each running 2 ms slow, that would be a slew — gentler, harder to detect, and (importantly) backwards-monotonic-safe. The two anomalies have very different consequences for a distributed protocol, and you cannot tell them apart at the 1-second scale that most monitoring dashboards use.

A production incident — RailWala's slot-allocation rollback, October 2025

RailWala runs the Tatkal-hour booking system on a 7-node fleet behind a stateless API gateway. Each booking write goes to a primary node which records the slot allocation with a wall-clock timestamp; a downstream reconciliation job sweeps the previous 60 seconds of writes every 5 seconds, rolling back any slots that are duplicated (defined as: same train, same coach, same seat-number, two writes within 200 ms of each other — a heuristic that catches the rare double-allocation race). The system has run for two years without incident. On 14 October 2025 at 10:00 IST — the start of the Tatkal window for one of the busier Diwali routes — the reconciliation job rolled back 3,847 legitimate bookings in a single 5-second sweep.

The cause was a clock step. Three of the seven primary nodes had been migrated overnight to a new VM hypervisor whose default NTP configuration was systemd-timesyncd rather than the previous chronyd. systemd-timesyncd's default behaviour on first start is to apply a single step correction of whatever offset it computes against the upstream pool. On 14 October the three migrated nodes booted at 09:58 IST and immediately stepped their clocks forward by 240 ms (the offset that had built up during the 90 seconds between the host coming up and timesyncd querying upstream). At 10:00 IST the bookings started; on the three stepped nodes, every write got a timestamp 240 ms ahead of its actual wall-clock arrival. On the four un-stepped nodes, timestamps were correct.

The reconciliation job's "two writes within 200 ms" heuristic now found a flood of false duplicates: a write for slot S1 written by un-stepped node-1 at t = 10:00:01.100 (true wall-clock) had a timestamp of 10:00:01.100; a write for an unrelated slot S2 from stepped node-3 at true wall-clock 10:00:01.000 had a timestamp of 10:00:01.240. The heuristic compared timestamps, saw a 140 ms gap, and started inspecting whether the writes were duplicates. They were not — they were different slots — but the spread of stepped vs un-stepped timestamps caused a cascading set of comparisons to reveal apparent duplicates that should never have been flagged. The job rolled back 3,847 bookings, sent ₹4.8 lakh in refunds (₹125 average ticket price × 3,847), and produced a customer-experience event large enough to land on a Mumbai newspaper's tech column the next day.

The post-incident root cause was filed as "NTP daemon swap on hypervisor migration applied step correction during peak booking window; reconciliation logic does not account for cross-node clock-step skew exceeding heuristic's 200 ms tolerance". The fix had three parts: (1) replace the wall-clock-based "200 ms within" heuristic with a logical ordering using monotonic-derived sequence numbers per primary, (2) configure all NTP daemons across the fleet with makestep 0.1 -1 (slew anything under 100 ms forever; never step) so that no daemon-restart event can produce a sudden timestamp jump, and (3) add a fleet-wide alarm on chronyc tracking | grep "Last offset" that fires if any node's offset exceeds 50 ms, well below the heuristic's tolerance. The third fix is the most important — the engineering team learned that they had been running clock-dependent business logic for two years without monitoring the clocks themselves.

This incident is a cleaner instance of the pattern from Part 2's wall chapter: no individual component failed. The NTP daemons did exactly what they are documented to do. The reconciliation job did exactly what it was designed to do. The hypervisor migration was applied correctly. The system fragmented because two cohorts of nodes disagreed about wall-clock time by 240 ms during a 5-minute window, and a piece of business logic was implicitly trusting that all seven nodes' clocks agreed within 200 ms — a trust that had no monitoring, no alarm, and no documented bound.

Failure modes you will see in production

Wall-clock-related failures cluster into a small number of recurring shapes. Pattern-match against this list when triaging an incident; in field experience these account for the large majority of clock-driven outages.

Hypervisor-induced step on guest start. A VM that has been suspended (host migration, snapshot restore, EBS volume reattach) sees its wall-clock paused at the suspension instant and resumed at the resumption instant — but the upstream world has moved forward by the wall-clock duration of the pause. On guest resume, the NTP daemon detects a large offset and applies a step. If the guest started serving traffic before the step landed, every request handled in that window has a wall-clock timestamp from before the step, after which a quick forward jump moves the clock past the value of subsequent requests. The signature: a cluster of timestamps that bunch around the suspension instant, then skip forward.

Asymmetric NTP path producing a one-sided slew. NTP estimates one-way delay as half the round-trip. If the network path is asymmetric — typical when one direction goes through a longer route, or when the upstream server's response queue has higher latency than the request queue — the offset estimate is wrong by half the asymmetry. A 20 ms asymmetry produces a 10 ms slew toward the wrong side, which the daemon happily applies. When the asymmetry resolves (a route change, a server's queue draining), the offset reverses and a new slew goes the other way. The signature: the host's Last offset oscillates between roughly opposite values without ever centring on zero.

Container clock-source disagreement. A Docker container running on a host whose clock source is kvm-clock inherits that source — unless the container's syscalls go through vDSO (the kernel's user-space accelerator for clock_gettime), which on some host configurations falls back to a different source. Two containers on the same host can read their clocks via different paths and observe ~600 ns of disagreement per call, which is invisible to most software but matters for trace-span ordering inside a single host's worth of microservices.

Daemon swap on package upgrade. As in the RailWala incident: a Linux distribution upgrade that replaces ntpd with chronyd (or chronyd with systemd-timesyncd) removes the running daemon's accumulated offset history and starts fresh. The new daemon's first action is to query upstream and apply a step correction of however much the local clock has drifted in the seconds between old daemon stopping and new daemon starting. Even a 90-second gap can produce a 5 ms step on a heavily-loaded host.

Why these four are the common case rather than "leap seconds": leap seconds happen at most twice a year by IERS announcement, and the last one was inserted on 2016-12-31. Hypervisor pauses, asymmetric NTP, container vDSO mismatches, and daemon-swap step-corrections happen weekly to monthly in any non-trivial fleet. The leap-second outage is famous, but operationally the recurring pattern is one of these four — and none of them get covered in the textbook treatment of "NTP". They are operational pathologies of the implementation, not the protocol.

Common confusions

Going deeper

What chronyc tracking is actually telling you

chronyc tracking outputs roughly a dozen fields; three of them carry most of the operational signal. Last offset is the most recent computed offset between local clock and the upstream reference, in seconds. RMS offset is the smoothed root-mean-square of recent offsets — a noise floor for how much the clock typically wanders. Frequency is the parts-per-million correction the daemon is currently applying to the kernel mult factor; a healthy host shows a small (<10 ppm) value, and a host whose quartz is drifting unusually shows tens of ppm. Update interval tells you how often the daemon polls upstream — defaults are 64 seconds and grow to 1024 seconds as confidence in the upstream improves. The number that should usually alarm you is Last offset exceeding 0.05 (50 ms) — that means either the upstream just stepped, the local clock just slewed off-target, or the network to upstream just got asymmetric. Wire this into your monitoring; most fleets do not, and the RailWala incident above is the typical consequence.

A healthy host on a Bengaluru data centre LAN talking to local pool servers typically shows output like:

Reference ID    : 4E5A0301 (time1.example.in)
Stratum         : 3
Last offset     : -0.000284 seconds   # 284 microseconds — healthy
RMS offset      : 0.000412 seconds    # ~400 us noise floor
Frequency       : 4.182 ppm slow      # quartz drifting; daemon correcting
Residual freq   : -0.013 ppm          # the daemon is well-tuned
Skew            : 0.187 ppm           # uncertainty in the frequency estimate
Root delay      : 0.003421 seconds    # network distance to Stratum 1
Root dispersion : 0.000867 seconds    # accumulated upstream uncertainty
Update interval : 256.0 seconds
Leap status     : Normal

Read the third line. Last offset of -284 microseconds is healthy for a host on a healthy LAN; sub-millisecond offsets indicate the daemon and the network are both behaving. If this number jumps to 0.060 (60 ms), an alarm should fire and the on-call should investigate before the protocol-level consequences arrive.

PTP (Precision Time Protocol) — when NTP is not enough

PTP (IEEE 1588) is what you reach for when the protocol you are running cannot tolerate NTP's 5–20 ms accuracy. PTP runs over Layer 2 Ethernet (so it does not work across L3 boundaries without special handling) and uses hardware-timestamped switches to eliminate the asymmetric-network-delay problem that limits NTP. Sub-microsecond accuracy is achievable on a properly-configured LAN with PTP-aware switches. The cost is the LAN itself: every switch in the path must be PTP-aware, every NIC must support hardware timestamps, and the configuration of grandmaster clock election is a tuning exercise of its own. Spanner's TrueTime uses PTP-style infrastructure on the rack scale plus GPS receivers for inter-data-centre sync. Most cloud providers (AWS Time Sync Service, Azure Precision Time) now offer PTP-grade accuracy as a managed service over EFA/SR-IOV networks; if you need ±100 µs, this is the practical answer.

What the kernel actually does on a leap-second insertion

The leap-second handling code in kernel/time/timekeeping.c has been rewritten three times over the last 15 years; it has been the source of a steady stream of subtle bugs. The current behaviour (Linux 4.0+) on a leap-second insertion is: at 23:59:59 UTC the kernel's tai_offset (the difference between TAI — atomic time — and UTC) is incremented from 37 to 38, but the wall-clock value of CLOCK_REALTIME is held at 23:59:59 for one extra second instead of advancing to 00:00:00. From the application's perspective, two consecutive time.time() calls at the second boundary may return the same value (or even backwards values during the held second). Java's Thread.sleep and many synchronisation primitives assume monotonicity and have produced infinite loops or 100% CPU spins. The Google "leap smear" technique (announced 2011, used at Google since 2008) avoids this by redefining the second's length over the surrounding 24-hour window — every second is 11.6 microseconds longer than nominal — so the leap is absorbed into an undetectable slew rather than a step. AWS, Cloudflare, Meta, and Microsoft now follow leap-smear strategies; Google's NTP servers (time.google.com) publish smeared time; if your fleet uses pool.ntp.org, you may still be subject to step-mode leap seconds whenever IERS announces one.

Why public NTP pools are not a single decision

When a Linux host configures pool 0.in.pool.ntp.org, it is not connecting to a server — it is connecting to a DNS rotation that returns a different IP every query, drawn from a community-volunteered pool of NTP servers in India. The pool's IP rotation means that consecutive chronyd queries can talk to entirely different upstream servers with different stratum levels, different reference clocks, and different network paths. A typical pool entry routes to a Stratum 2 server somewhere in Mumbai, Bangalore, or Chennai whose own upstream is a Stratum 1 server (often time.nist.gov or a national time authority). Round-trip times to pool servers in India range from 5 ms to 60 ms; the asymmetry distribution depends on the BGP routing of the day. The implication: even with a "correctly configured" pool entry, your host's wall-clock accuracy is bounded by the worst upstream the rotation hands you in any given hour, not the best. Operators running clock-sensitive infrastructure typically pin to a small set of explicitly-named upstream servers whose path latency they monitor — turning the pool's stochastic choice into a deterministic one.

Why this matters for distributed-systems protocol design: a 60 ms upstream RTT means your host's upper bound on wall-clock accuracy is ~30 ms (the half-RTT), and even that bound is only valid if the path is symmetric. Spanner's TrueTime quotes 7 ms p99 because Google operates dedicated GPS receivers and atomic clocks per data centre, with PTP-grade infrastructure. Your pool.ntp.org-configured host operates at 5–20 ms accuracy on the LAN, 20–100 ms accuracy across the public internet, and worse during BGP route flaps. Any protocol that assumes "5 ms skew is enough" needs to either deploy PTP-grade infrastructure or budget more skew tolerance — there is no "configure NTP harder" option that closes the gap.

Reproduce this on your laptop

# Reproduce the wall-clock probe and observe a step correction
python3 -m venv .venv && source .venv/bin/activate
python3 probe_wall_clock.py &
sleep 5
sudo chronyc makestep            # force a step correction
wait

# Inspect your current clock-source choice and NTP state
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
chronyc tracking
chronyc sources -v

# See which TSC features your CPU has
grep -E "constant_tsc|nonstop_tsc|tsc_known_freq" /proc/cpuinfo | head -1

Where this leads next

This chapter named the wall clock as a stack of approximations on top of a quartz crystal that NTP nudges into rough agreement with UTC. The next chapters show what to do about it — each one a different way of bypassing or constraining the underlying lie:

The pattern across all five: every solution either gives up on physical time (logical clocks), constrains it tightly with hardware (PTP, TrueTime), or restricts its use to intervals on a single machine (monotonic). What no solution does is make a vanilla NTP wall-clock trustworthy enough for cross-machine ordering. That is the limit named here.

What this means in practical terms for your service: the moment you find yourself writing code of the form if time.time() > deadline: or record.timestamp = time.time() and that record will be ordered against records written by other machines, you are at a decision point. Either bound the cross-machine skew you are willing to tolerate (and monitor chronyc tracking to alarm if reality exceeds the budget), or replace the wall-clock dependence with a logical or hybrid clock from one of the chapters above. The middle option — using wall-clock and hoping NTP keeps it close enough — is the path that produces the RailWala-style incident every 18 to 24 months in any non-trivial fleet. The chapters that follow are the principled alternatives.

References

  1. Mills, "Network Time Protocol Version 4: Protocol and Algorithms Specification" — RFC 5905, the binding NTP specification including the slew-vs-step state machine.
  2. chrony — comparison with ntpd — the practical reference for the daemon most modern Linux distributions ship by default.
  3. Corbett et al., "Spanner: Google's Globally-Distributed Database" — OSDI 2012. The TrueTime architecture that defines what "bounded clock uncertainty" looks like as an API.
  4. Kulkarni et al., "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" — the HLC formalisation referenced as the practical synthesis when TrueTime is not available.
  5. IEEE 1588-2019 — PTP standard — the formal definition of Precision Time Protocol used in PTP-grade infrastructure.
  6. Google's leap-second smear announcement — the engineering case for spreading leap seconds over 24 hours rather than stepping at midnight.
  7. Wall: nothing works without time — internal cross-link to the Part 2 wall chapter that motivates this whole part of the curriculum.
  8. Designing Data-Intensive Applications, Chapter 8 — Kleppmann, O'Reilly 2017. The "Unreliable Clocks" section synthesises NTP, leap seconds, and protocol consequences for a practitioner audience.