Hybrid logical clocks (HLC)

Karan, an SRE at PaySetu, is debugging a payment-reconciliation job at 02:30 IST. The job replays writes from a five-region CockroachDB cluster and the trace shows a refund applied before the original charge — even though the charge's NTP timestamp is 11 ms older than the refund's. The cluster is healthy, NTP is within 4 ms across all regions, and the application code is provably correct. The bug is the timestamp: two writes one millisecond apart on different nodes can land in any order on a third node when wall-clocks are involved. Karan stares at the trace and wonders why the database doesn't just use Lamport timestamps. The answer is that he wants both: causal correctness and a timestamp that maps recognisably to wall-clock time, so the dashboard he opens in an hour can plot "writes at 02:30:14" rather than "writes at logical-counter 8723941". Hybrid logical clocks (HLC) are the construction Kulkarni, Demirbas, Madappa, Avva, and Leone published in 2014 that gives him both: a 64-bit timestamp that is causally correct like Lamport, drifts at most a bounded amount from true wall-clock time, and is monotonic per node even when the underlying clock jumps backward.

An HLC is a tuple (l, c) where l is a 48-bit physical-time-like component (milliseconds since epoch) and c is a 16-bit logical counter. On every event, l advances toward the local wall clock but never falls below the highest l it has seen; c increments only when l is held flat to disambiguate events within the same millisecond. The result is comparable as a single integer, captures the happens-before relation precisely, and stays within milliseconds of NTP — so dashboards still read 02:30:14 and the database still gives you a serialisable order. CockroachDB, MongoDB, YugabyteDB, and ScyllaDB all run HLCs in production.

Why neither wall-clocks nor Lamport timestamps are enough

A wall-clock timestamp from gettimeofday() reads like 1714003814.231 — milliseconds since 1970, easy to plot on a dashboard, easy to compare with <. The trouble is that NTP-disciplined clocks lie under stress: when the local quartz drifts ahead by 8 ms, NTP corrects it by stepping the clock backward, and a sequence of events at t=100, 101, 102 from one process can suddenly read t=100, 95, 96 after a step. Even worse, leap seconds insert or delete a full second of time. A timestamp comparison that treats wall-clock time as monotonic is wrong by construction; the only thing it gives you for free is approximate human-readable time.

A Lamport timestamp sidesteps the monotonicity problem completely: it is a pure counter incremented on every local event and on every receive, and it preserves the strong clock condition (a → b ⇒ L(a) < L(b)). What it does not give you is any connection to physical time. If your Lamport counter is 8723941 you have no idea whether the event happened at 02:30 or 14:30 today, or even on the same day. For internal causality this is fine. For a database that has to expose timestamps to operators ("show me everything written between 02:30 and 02:35") and to clients ("read-as-of timestamp T"), pure Lamport is a non-starter.

Vector clocks preserve causality precisely (you can detect concurrency, not just total-order), but their size grows with N (the number of processes), which we explored in the dotted-version-vectors chapter. Even a 3-replica vector is 24+ bytes, replicated on every key. Databases want 8-byte fixed-width timestamps so that every row, every WAL entry, every MVCC version carries one cheaply.

The HLC sits in the gap. It is not a replacement for TrueTime, which gives Spanner bounded uncertainty intervals using GPS and atomic clocks — a stronger guarantee at much higher cost. The HLC asks: given only commodity NTP-disciplined clocks, can we splice physical and logical time into a single 64-bit value that captures causality precisely and stays close to wall-clock time? Kulkarni et al. proved yes.

Three timestamp regimes — wall clock, Lamport, HLC — under a clock-jump-backward eventThree rows show how a sequence of three events on a node is timestamped under wall clock, Lamport timestamp, and HLC, with NTP stepping the clock backward by 4 ms between event 2 and event 3.Three events on one node, with NTP stepping the wall clock backward by 4 ms before event 3 02:30:14.000 02:30:14.005 02:30:14.010 wall clock e1: t=14.000 e2: t=14.005 e3: t=14.001 (step-back) e3 < e2 — wrong Lamport e1: L=1 e2: L=2 e3: L=3 no wall time HLC e1: (14.000, 0) e2: (14.005, 0) e3: (14.005, 1) monotonic + close to wall
Illustrative — under an NTP step-back, wall-clock timestamps go non-monotonic and break ordering, Lamport stays monotonic but loses physical time, HLC stays both monotonic and close to wall-clock time by holding `l` flat at 14.005 and incrementing the logical counter `c` from 0 to 1.

Why the HLC pins l at 14.005 instead of letting it slide to 14.001 with the wall clock: the HLC update rule says l := max(l_prev, wall_now, l_msg). If l_prev = 14.005 and wall_now = 14.001, then l := 14.005 — the local clock is held flat, and the logical counter c increments to disambiguate. The physical component never goes backward. This is the key invariant: events on a single node have strictly increasing HLCs even when the underlying wall clock regresses. Without this, the database's MVCC layer would observe writes that "go back in time" within a single transaction, which corrupts snapshot reads.

The HLC update rules — three lines that do the work

Kulkarni et al.'s 2014 paper specifies the HLC with two update procedures: one for local events (a write originating at this node) and one for receive events (a message arriving from another node carrying its sender's HLC). Both procedures touch a per-node state (l, c) — the last HLC this node emitted. The wall-clock function pt() returns the current physical time in milliseconds (NTP-disciplined, but assumed only loosely synchronised — within epsilon ms of true time).

Local event update

on local event:
    l_old := l
    l := max(l_old, pt())
    if l == l_old:
        c := c + 1
    else:
        c := 0
    emit (l, c)

If wall-clock time has advanced past the last l, take it — the HLC's physical component tracks wall time when wall time is monotonically advancing. If wall time has stalled or regressed (same millisecond, or NTP step-back), hold l at its previous value and increment c to keep the timestamp strictly increasing.

Receive event update

on receive (l_msg, c_msg):
    l_old := l
    l := max(l_old, l_msg, pt())
    if l == l_old and l == l_msg:
        c := max(c, c_msg) + 1
    elif l == l_old:
        c := c + 1
    elif l == l_msg:
        c := c_msg + 1
    else:
        c := 0
    emit (l, c)

The receive rule absorbs the sender's HLC: the new l is the max of three sources (local, message, wall), and c is set so that the resulting (l, c) strictly dominates both (l_old, c_old) and (l_msg, c_msg) — the strong clock condition. The four cases handle the four ways the new l can equal an existing source.

The arithmetic looks fiddly but the invariant is clean: HLC is the smallest pair (l, c) such that (l, c) > (l_old, c_old) and (l, c) > (l_msg, c_msg) and l ≥ pt(). The minimality matters because the bigger l - pt() grows, the further the HLC drifts from wall time, and the database operator's intuition ("a row written at 02:30:14 should have HLC physical-component near 14000") starts to break.

Why the receive rule uses max(l_old, l_msg, pt()) rather than just max(l_old, l_msg): if the network round-trip stalled the message for 200 ms but both nodes' wall clocks advanced normally, the local pt() will already be past l_msg, and using max with all three sources ensures the new HLC reflects wall time rather than catching up only to the (now-stale) sender's l_msg. Skipping pt() in the max would let HLC drift behind wall time on every receive — fine for causality, bad for "the dashboard says 02:30:14".

A 60-line HLC simulator with three nodes and a clock skew

The cleanest way to internalise HLC is to run three nodes with skewed wall clocks, send messages between them, and watch the HLC components track both wall time and message causality.

# hlc_sim.py — three-node HLC with skewed wall clocks and message exchange
import time
from dataclasses import dataclass

@dataclass
class HLC:
    l: int = 0  # physical-component, ms since epoch
    c: int = 0  # logical counter

    def __lt__(self, other):
        return (self.l, self.c) < (other.l, other.c)

    def __repr__(self):
        return f"({self.l - 1714003814000}+{self.c})"  # show offset for readability

class Node:
    def __init__(self, name, skew_ms=0):
        self.name = name
        self.skew = skew_ms
        self.hlc = HLC()

    def pt(self):
        # Wall clock as integer milliseconds, plus a per-node skew
        return int(time.time() * 1000) + self.skew

    def local_event(self, label):
        l_old = self.hlc.l
        l_new = max(l_old, self.pt())
        c_new = self.hlc.c + 1 if l_new == l_old else 0
        self.hlc = HLC(l_new, c_new)
        print(f"  {self.name} local  {label:>10}  hlc={self.hlc}  wall={self.pt() - 1714003814000}")
        return self.hlc

    def receive(self, msg_hlc, label):
        l_old, c_old = self.hlc.l, self.hlc.c
        l_msg, c_msg = msg_hlc.l, msg_hlc.c
        l_new = max(l_old, l_msg, self.pt())
        if l_new == l_old == l_msg:    c_new = max(c_old, c_msg) + 1
        elif l_new == l_old:           c_new = c_old + 1
        elif l_new == l_msg:           c_new = c_msg + 1
        else:                           c_new = 0
        self.hlc = HLC(l_new, c_new)
        print(f"  {self.name} recv   {label:>10}  hlc={self.hlc}  msg={msg_hlc}")
        return self.hlc

if __name__ == "__main__":
    # Three nodes; node B's wall clock runs 8 ms ahead of A; C's runs 3 ms behind
    A = Node("A", skew_ms=0)
    B = Node("B", skew_ms=8)
    C = Node("C", skew_ms=-3)

    print("--- A writes, sends to B ---")
    h1 = A.local_event("write x=1")
    B.receive(h1, "from A")
    print("--- B writes locally, sends to C ---")
    h2 = B.local_event("write y=2")
    C.receive(h2, "from B")
    print("--- C tries to write — its wall clock is behind, watch HLC pin to msg ---")
    C.local_event("write z=3")
    print("--- A receives an old message from B with stale HLC ---")
    A.receive(h1, "from B (stale)")  # A already past h1
    print(f"\nFinal HLCs: A={A.hlc}  B={B.hlc}  C={C.hlc}")
    assert A.hlc < C.hlc or C.hlc < A.hlc or A.hlc == C.hlc

Sample run:

--- A writes, sends to B ---
  A local      write x=1  hlc=(412+0)  wall=412
  B recv          from A  hlc=(420+0)  msg=(412+0)
--- B writes locally, sends to C ---
  B local      write y=2  hlc=(421+0)  wall=421
  C recv          from B  hlc=(421+1)  msg=(421+0)
--- C tries to write — its wall clock is behind, watch HLC pin to msg ---
  C local      write z=3  hlc=(421+2)  wall=410
--- A receives an old message from B with stale HLC ---
  A recv     from B (stale)  hlc=(413+0)  msg=(412+0)

Final HLCs: A=(413+0)  B=(421+0)  C=(421+2)

The output tells the whole story. A writes at wall time 412 ms (relative to a base epoch), produces HLC (412, 0). B's wall clock is 8 ms ahead, so when B receives A's HLC it adopts the local wall (420) — the physical component tracks wall time. B then locally writes at wall 421, produces (421, 0) and sends to C. C's wall clock is 3 ms behind, but the receive rule pins l to 421 (max of msg, local, wall) and increments c to 1 — physical component is ahead of C's wall clock, but only by the same margin as B's skew, bounded. When C then writes locally, its wall clock is still 410 (behind), so the HLC stays at l=421 and c increments to 2. The HLC has cleanly tracked causality (A → B → C writes are strictly ordered) without letting C's behind-by-3 wall clock invent an out-of-order timestamp.

Why the assertion A.hlc < C.hlc or C.hlc < A.hlc or A.hlc == C.hlc is trivially true: HLC is a total order — every two HLCs are comparable as (l, c) lexicographic pairs. This is stronger than what causality requires; vector clocks give a partial order that flags concurrency. HLC trades the concurrency-detection ability for cheap total-order comparability and 8-byte fixed width. Databases want this trade because their MVCC layer needs some total order on writes, and "concurrent writes" still need to be ordered for the storage engine — they just don't need to be ordered "correctly" in any application-meaningful sense, which is fine because the application uses higher-level conflict resolution.

Why the offset l - pt() is bounded by the maximum clock skew: at any point, l was advanced either to pt() of the current node, or to l_msg of a sender. The sender's l_msg was itself bounded by its own pt() plus the largest skew on its network neighbourhood. By induction, after a finite number of hops, l cannot exceed max(pt() across the cluster) + epsilon, where epsilon is the largest pairwise skew. This bound is what lets database operators reason "an HLC of 14005 corresponds to wall time within ±epsilon of 14005" — the clock has not run away into logical-counter land.

A production story — CockroachDB's max_offset and the HLC catch-up

CockroachDB ships HLC as the timestamp authority across its multi-region clusters. The configuration knob --max-offset defaults to 500 ms — the assumed maximum NTP skew between any two nodes. CockroachDB enforces this aggressively: if a node detects that its wall clock is more than 80% of max-offset away from a peer's HLC, it panics and exits. The reasoning is that HLC's bounded-drift guarantee depends on the clock-skew assumption holding; if the assumption is violated, the database can no longer guarantee serialisable transactions, and crashing is preferable to silently returning wrong answers.

In April 2025, the on-call SRE Aditi at PaySetu (running CockroachDB across ap-south-1, ap-southeast-1, and eu-west-1) saw three nodes panic-exit within 90 seconds of each other. The error in the logs was clock synchronization error: detected clock offset of 612 ms from peer node. NTP on the affected hosts had been silently failing for three days — the chrony daemon's upstream had become unreachable, and the local clock had drifted by 612 ms over 72 hours. CockroachDB's HLC layer detected this on the next inter-node heartbeat exchange (HLCs from peers were arriving with l 600 ms ahead of local wall clock), refused to continue, and crashed.

Aditi's recovery sequence: re-point the affected nodes' chrony at a working stratum-2 server, wait for the local clock to step into compliance (a 612 ms step takes about 6 seconds at chrony's default slew rate, but a single hard step can be done with chronyc makestep), then restart the cockroach processes. Service was restored in 8 minutes. The key lesson, written into the post-mortem: HLC is robust to small skews and crashes loudly on large ones — by design. The alternative (silently absorbing a 612 ms skew) would have corrupted the database's snapshot-isolation reads.

CockroachDB max_offset panic timeline at PaySetuA timeline showing NTP failure starting 72h before, clock drift growing linearly, three nodes detecting offset over 80% of max_offset and crashing within 90 seconds, recovery via chronyc makestep at minute 8.CockroachDB clock-offset detection — silent NTP failure to crash to recovery 700ms 400ms (80% threshold) 0 crash threshold drift accumulating node-1 panic node-2 node-3 chronyc makestep → recovery t-72h t-24h incident start t+8min
Illustrative — clock drift accumulating silently over three days while NTP was broken, the moment three nodes crossed CockroachDB's 80%-of-max_offset threshold within 90 seconds, and the operator's `chronyc makestep` recovery. The crash-loud-on-skew property is what protected snapshot-isolation reads from corruption.

The deeper architectural lesson: HLC is not a substitute for clock synchronisation, it is a consumer of clock synchronisation. CockroachDB pairs HLC with strict offset detection because the timestamp guarantees only hold under the skew assumption. MongoDB's HLC implementation makes a different trade — MongoDB caps l - pt() at 1 second (the --maxDriftMillis parameter) and refuses to advance HLC further, which preserves wall-clock proximity at the cost of stalling writes if a peer's HLC arrives too far ahead. Both designs treat HLC as a correctness primitive that depends on bounded skew, and both have a crash/stall posture for when the bound is violated.

Common confusions

Going deeper

The Kulkarni et al. 2014 paper and the bounded-drift theorem

Kulkarni, Demirbas, Madappa, Avva, and Leone proved formally in their 2014 paper "Logical Physical Clocks" (OPODIS) that HLC satisfies four properties simultaneously: (1) strong clock condition (a → b ⇒ HLC(a) < HLC(b)), (2) bounded drift (|l - pt| is bounded by the maximum clock skew across the system), (3) monotonicity per node, and (4) constant fixed-width representation. The proof works by showing that the HLC update rules form a well-founded recursion: every event's HLC is a function of the local state, the wall clock, and at most one received HLC, and the max operation ensures the new HLC dominates all three. The bounded-drift result is the key engineering claim — it means HLC is operationally usable for time-range queries ("show me writes between 02:30 and 02:35") whose results will be at most epsilon ms wrong. The paper also shows that HLC degenerates gracefully: if all clocks are perfectly synchronised, HLC behaves like a wall-clock timestamp with a tiebreaker; if clocks are arbitrarily skewed, HLC degenerates toward Lamport semantics (with l mostly held flat and c doing the work).

MongoDB's clusterTime vs CockroachDB's HLC — different design points

MongoDB introduced HLC as clusterTime in version 3.6 (2017), exposed to clients as a session token (afterClusterTime in read concerns). MongoDB's design specifically restricts HLC's l from drifting more than 1 second ahead of pt() — if a peer's HLC arrives with l - pt > 1000ms, MongoDB rejects the message rather than absorbing it. This trades robustness-to-skew for tighter wall-clock proximity. CockroachDB's choice (default max_offset=500ms, panic on detection) is the opposite: absorb the HLC, but crash if skew is too large. Both are defensible; the MongoDB choice is better for clients that send HLC tokens around long-running sessions (the token won't surprise a node hours later); the CockroachDB choice is better for tight transactional workloads where any silent skew is unacceptable.

Why HLC is the right primitive for snapshot isolation

CockroachDB's MVCC layer assigns every write an HLC and exposes "read at HLC t" semantics: the read returns the most recent committed value with HLC ≤ t. Because HLC is monotonic per node and bounded-drift across the cluster, "read at HLC equal to current wall time" approximates "read the recent committed state" to within epsilon ms. This is what makes HLC pair so naturally with snapshot isolation: the snapshot is identified by an HLC, and the snapshot is consistent because HLC respects causality. Compare this with wall-clock-only MVCC (which can produce non-repeatable reads under NTP step-backs) and pure-Lamport MVCC (which has no readable physical-time component). HLC is the smallest construction that supports both database operators ("show me the snapshot at 02:30") and database internals ("don't let a write's HLC go backward").

The <wall_time, logical_counter, node_id> extension — fully ordered HLCs

Some implementations (notably YugabyteDB) extend HLC with a third component, the originating node id: (l, c, node_id). This gives a strict total order even when two nodes happen to emit the exact same (l, c) pair (rare but possible under extreme load). The node id breaks the tie deterministically, which simplifies certain consensus protocols (every Paxos round needs unique proposal numbers; an HLC plus node id makes "use HLC as proposal number" trivially correct without further deduplication). The cost is 8 more bytes per timestamp; for systems where every row carries an MVCC version, this matters and most production HLCs do without the node id, falling back to first-mover-wins on the rare collision.

Reproduce this on your laptop

# Reproduce the HLC simulation
python3 hlc_sim.py

# Stress: simulate 1000 events with random skew, verify HLC stays monotonic per node
python3 -c "
from hlc_sim import Node
import random, time
nodes = [Node(f'n{i}', skew_ms=random.randint(-50, 50)) for i in range(5)]
last = {n.name: None for n in nodes}
for _ in range(1000):
    n = random.choice(nodes)
    h = n.local_event('w')
    if last[n.name] is not None:
        assert last[n.name] < h, f'{n.name} HLC went backward'
    last[n.name] = h
    if random.random() < 0.3:
        target = random.choice([m for m in nodes if m is not n])
        target.receive(h, 'gossip')
print('HLC monotonicity held across 1000 events with random skew up to 50ms')
"

Where this leads next

HLC is the joining point between Part 3's clock theory and the production-grade timestamp authority used by Part 8's consensus systems and Part 12's consistency models:

The unifying lesson: HLC is what you reach for when you want most of TrueTime's properties without TrueTime's hardware budget. You give up bounded uncertainty (epsilon-wait commit) and you gain commodity-NTP deployability. For most production databases that is the right trade. For the rare workload that genuinely needs external consistency at global scale, TrueTime (or a Paxos-elected timestamp service like Spanner's, or a hybrid like CockroachDB's linearizable=true mode that adds wait time) is the answer.

References

  1. Kulkarni, Demirbas, Madappa, Avva, Leone, "Logical Physical Clocks" — OPODIS 2014. The canonical HLC paper. Defines the construction, proves the four properties, and bounds the drift.
  2. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" — CACM 1978. The foundation HLC's logical-component descends from.
  3. CockroachDB HLC and clock-offset documentation — production reference for max_offset, panic behaviour, and the integration with MVCC.
  4. MongoDB Hybrid Logical Clock and $clusterTime — MongoDB's HLC implementation, exposed via session tokens.
  5. YugabyteDB clock-skew handling — how a different production HLC handles the (l, c, node_id) extension and clock-skew detection.
  6. Corbett et al., "Spanner: Google's Globally-Distributed Database" — OSDI 2012. The TrueTime alternative; useful contrast for understanding what HLC trades away.
  7. Logical clocks (Lamport) — internal cross-link to the chapter HLC builds on.