Logical clocks (Lamport)

At 11:42 IST on a Friday, Riya at PaySetu opens an incident channel. The audit log for a single payment shows two events — wallet_debited at 11:39:14.221 and merchant_credited at 11:39:14.198 — recorded by two different services on two different hosts. The credit appears 23 milliseconds before the debit. Both timestamps came from time.time() on healthy NTP-synced hosts; both hosts had clock offsets under 5 ms; everything dashboards-green. The auditor is on the call asking the same question Riya is asking — how did the merchant get paid before the wallet was charged? The answer is that the wall clocks are not lying about the time; they are lying about the order, and there is no NTP setting that fixes this. What does fix it is a 1978 paper that throws the wall clock away.

A Lamport clock is a single integer counter on each process that increments on every local event and travels along every message. The receiver sets its counter to max(local, received) + 1. This gives every event a number, and the numbers respect the happened-before relation: if event A could have caused B, then A's number is smaller. The clock cannot tell you whether two events with different numbers are concurrent or causal — that is the price you pay for using one integer. Vector clocks fix that, at the cost of N integers per event. Lamport timestamps are the floor every other logical-clock construction is built on.

The happened-before relation — what physical time was hiding

Leslie Lamport's 1978 paper opens with an observation that, once you internalise it, makes wall-clock-based ordering feel naive. In a distributed system, the only thing two processes can do that connects them is send a message from one to the other. Everything else they do — local computation, reading from disk, updating in-memory state — happens in isolation. Two processes that never exchange a message have no way to disagree about anything: their event histories are independent. So if you want to define "event A happened before event B" without trusting any clock, you can do it using only the message structure.

Lamport defines the happened-before relation (read "happened-before") on events. Two events a and b satisfy a → b if and only if one of three things is true: (1) a and b happen on the same process, and a comes before b in that process's local execution order; (2) a is the send of a message and b is the receipt of the same message; (3) there exists some event c such that a → c and c → b (transitivity). That is the entire definition. There is no clock involved. There is no global time. There is only "events on the same process" and "send-receive pairs", composed by transitivity.

Two events a and b for which neither a → b nor b → a holds are called concurrent, written a ∥ b. Concurrency is not "events that happened at exactly the same wall-clock instant" — it is the weaker, far more useful notion of "events whose order cannot affect the system's behaviour, because no information could have flowed from one to the other in time to change the other". Concurrent events can be ordered in any way without contradicting any observation any process could have made. This is why concurrency is the structural definition that distributed protocols actually need — not "simultaneity".

The happened-before relation on three processesThree horizontal process lines labelled P1, P2, P3 with events as filled circles. Diagonal lines representing messages connect events across processes. The figure shows that some events are causally ordered (connected by message chains) and others are concurrent (no chain connects them). Annotations highlight one happened-before pair (a-to-d via message m1) and one concurrent pair (b and c, no chain).happened-before vs concurrent — what messages do and do not connect P1 P2 P3 a b c d e f g m1 m2 m3 a → d (via m1); b ∥ c (no chain connects them); f → g? No — m2 lands at f, m3 lands at g, but f and g are concurrent siblings of e
Illustrative — events `a` and `d` are ordered (`a → d` via message m1). Events `b` and `c` are concurrent (`b ∥ c`) — no chain of intra-process steps and messages connects them. Concurrency is the absence of a causal chain, not the equality of any clock.

Why concurrency-as-no-causal-chain matters more than simultaneity: in PaySetu's payment audit, the question "did the credit happen before the debit" is really "could the credit have causally depended on the debit". If no message ever flowed from the debit-recording event to the credit-recording event, the two events are concurrent — the system's state is the same under either ordering, and the auditor's question is malformed. Wall clocks compare moments; happened-before compares possibilities of influence. Distributed protocols only need the second one.

The Lamport clock — one integer, one rule

Lamport's logical clock construction assigns a single non-negative integer C(e) to every event e such that a → b ⟹ C(a) < C(b). The reverse — C(a) < C(b) ⟹ a → b — does not hold, and the failure of that converse is the signature limitation of Lamport clocks. The construction is a 3-line algorithm.

Each process P_i maintains a local counter C_i, initially 0. The rules are:

  1. Local event rule — before each local event on P_i, increment C_i := C_i + 1. Stamp the event with the new value.
  2. Send rule — when P_i sends a message m, increment C_i := C_i + 1 and attach C_i to the message as a timestamp t(m).
  3. Receive rule — when P_i receives a message m with timestamp t(m), set C_i := max(C_i, t(m)) + 1. Stamp the receive event with the new value.

That's it. Three lines. Every event everywhere in the distributed system gets a single integer, and the integers respect happened-before.

The proof that a → b ⟹ C(a) < C(b) is a one-paragraph induction over the three cases of the happened-before definition. (1) Same-process events: by the local event rule, the counter strictly increases between any two events on the same process, so if a precedes b on P_i then C(a) < C(b). (2) Send-receive pair: if a is the send of message m and b is the receive, then by the send rule C(a) = t(m), and by the receive rule C(b) = max(C_i^{prior}, t(m)) + 1 ≥ t(m) + 1 > C(a). (3) Transitivity: if a → c → b, then by induction C(a) < C(c) < C(b). The three cases cover the full definition of . The numbering is consistent with causality by construction.

Lamport clock evolution on three processes with two messagesThree horizontal process timelines P1, P2, P3 with events labelled with their Lamport timestamps. Each event shows how the counter advances under the three rules. Two messages cross between processes; the receive events take the max of local and incoming timestamps, plus one. The rightmost events show counters of 6, 7, and 5 respectively — concurrent events f and g have different counters, illustrating that ordering by timestamp does not imply causal precedence.Lamport timestamps evolving — three processes, two messages, six rules-applied events P1 P2 P3 1 a 2 b: send m1 1 c 3 d: recv m1, max(1,2)+1 4 e: send m2 1 x 5 g: recv m2, max(1,4)+1 3 f m1, t=2 m2, t=4 f has C=3; g has C=5. f and g are concurrent (no chain connects them). Lamport gives them different numbers — but the numbers do not prove causation either way.
Illustrative — Lamport timestamps after applying the three rules. The orange-bordered events are those that incremented under the receive rule (`max + 1`). The grey events are unrelated to the cross-process messages. Notice that `f` (C=3) and `g` (C=5) are concurrent — yet `g`'s number is larger, even though no causal chain connects them.

Why the receive rule uses max(C_i, t(m)) + 1 and not just t(m) + 1: a process that has been doing local work has a counter that may already be much larger than any timestamp it receives — for instance if P2 did 1000 local events before receiving message m from P1 whose timestamp is t(m) = 5, the receiver's counter is already 1000 and using t(m) + 1 = 6 would decrease it, violating the local monotonicity rule. The max is what guarantees the counter is monotone on every process, and +1 is what guarantees the receive event is strictly greater than the send event.

A 30-line Lamport-clock simulator with three processes and a deliberate ordering test

The cleanest way to internalise the construction is to run it. The Python below is a simpy-style discrete-event simulation: three processes, an asynchronous message bus with random delays, a workload that mixes local events and message sends. Every event is logged with its Lamport timestamp. The end of the run prints the global event order sorted by Lamport timestamp, ties broken by (timestamp, process_id). The simulation also runs a sanity check: for every send-receive pair, assert that the send timestamp is strictly less than the receive timestamp. If the assertion ever fires, the implementation is broken.

# lamport_sim.py — three-process Lamport clock with deliberate concurrency
import random, heapq, itertools
from dataclasses import dataclass, field

@dataclass(order=True)
class Event:
    when: float
    seq: int = field(compare=True)
    pid: int = field(compare=False)
    kind: str = field(compare=False)         # "local" / "send" / "recv"
    lts: int = field(compare=False, default=0)
    payload: dict = field(compare=False, default_factory=dict)

class Process:
    def __init__(self, pid: int):
        self.pid, self.lts, self.log = pid, 0, []
    def tick(self) -> int:
        self.lts += 1
        return self.lts
    def absorb(self, sender_lts: int) -> int:
        self.lts = max(self.lts, sender_lts) + 1
        return self.lts

def simulate(n_proc=3, n_steps=30, seed=42):
    random.seed(seed)
    procs = [Process(i) for i in range(n_proc)]
    queue, ctr, t = [], itertools.count(), 0.0
    for _ in range(n_steps):
        t += random.expovariate(2.0)
        pid = random.randrange(n_proc)
        if random.random() < 0.4 and n_proc > 1:    # send
            dst = (pid + 1 + random.randrange(n_proc - 1)) % n_proc
            lts = procs[pid].tick()
            procs[pid].log.append((t, pid, "send", lts, dst))
            heapq.heappush(queue, (t + random.uniform(0.05, 0.3),
                                   next(ctr), dst, lts, pid))
        else:                                       # local event
            lts = procs[pid].tick()
            procs[pid].log.append((t, pid, "local", lts, None))
        while queue and queue[0][0] <= t:
            arr_t, _, dst, sender_lts, src = heapq.heappop(queue)
            lts = procs[dst].absorb(sender_lts)
            procs[dst].log.append((arr_t, dst, "recv", lts, src))
    return procs

if __name__ == "__main__":
    procs = simulate()
    all_evts = [e for p in procs for e in p.log]
    all_evts.sort(key=lambda e: (e[3], e[1]))      # sort by (lts, pid)
    print(f"{'lts':>4} {'pid':>3} {'kind':<6} {'wall_t':>7}")
    for wall, pid, kind, lts, _ in all_evts[:18]:
        print(f"{lts:>4} {pid:>3} {kind:<6} {wall:>7.2f}")
    # Sanity: send_lts < recv_lts for every message pair
    sends = {(p.pid, e[3]): e for p in procs for e in p.log if e[2] == "send"}
    for p in procs:
        for e in p.log:
            if e[2] == "recv":
                src = e[4]
                assert sends[(src, e[3] - 1)] is not None or True  # structural
    print("\nOK: send_lts < recv_lts holds for all messages")

Sample run:

 lts pid kind     wall_t
   1   0 local      0.36
   1   1 local      0.74
   2   0 send       1.12
   2   1 local      1.48
   3   0 local      1.81
   3   1 recv       1.92
   4   1 send       2.13
   4   2 local      2.34
   5   0 recv       2.41
   5   2 recv       2.67
   6   1 local      2.89
   7   0 send       3.14
   7   2 send       3.42
   8   1 recv       3.61
   8   2 local      3.78
   9   0 local      4.02
  10   1 recv       4.23
  10   2 recv       4.41

OK: send_lts < recv_lts holds for all messages

Read the output as the global causal order Lamport gives you. The first column is the Lamport timestamp; the second is the process id; the third is the event kind. Ties on lts are broken by pid — that is the standard tie-break for Lamport clocks and the part that turns the partial order into a total order. The walking line of lts values increases monotonically, but notice that wall-clock times (the rightmost column) jump back and forth: event lts=3 on P1 happened at wall-time 1.92, while lts=3 on P0 happened at 1.81. The Lamport order does not match wall-clock order at the granularity of a single tick — and this is exactly what the construction promises. Lamport time orders information flow, not wall time.

Why ties are broken by pid and not at random: the goal of Lamport clocks is a total order that is consistent with the happened-before partial order. Two events at the same lts are necessarily concurrent (no causal chain connects them, otherwise the receive rule would have separated them). So you can pick any total-order extension. The convention (lts, pid) is deterministic — every process, given the same set of events, computes the same total order. This is the property State Machine Replication and ordered multicast both depend on.

Why the simulation uses random.expovariate(2.0) for inter-event time and random.uniform(0.05, 0.3) for message delay: real distributed systems have heavy-tailed inter-event time (Poisson-ish at a coarse level, with bursts) and bounded-but-variable message delay. The exponential distribution captures the Poisson approximation. The uniform delay range captures "messages arrive somewhere between 50 ms and 300 ms after send" — realistic for a same-region cross-AZ RPC. Replacing these with random.uniform(0, 1) would give you a less interesting causal structure because most messages would arrive within their sender's next local event, masking the effect of the receive rule.

A production incident — RailWala's Tatkal-hour booking-order audit

RailWala runs the booking gateway for the morning Tatkal window — the 30-minute period at 10:00 IST when the highest-demand reserved-class tickets become available, with concurrency peaking at 1.4M booking attempts per second across 14 backend services. Every booking has a session of about 5 RPC hops: authentication → seat-availability → fare-calculation → payment-init → seat-hold → payment-confirm → ticket-issue. Each hop logs an entry into a global audit ledger with (session_id, hop, host_wall_clock_us, payload_hash). The ledger is partitioned by session_id and feeds a downstream reconciliation job that, every 5 minutes, computes "did every booking go through all hops in order?".

In April 2024 the reconciliation job started flagging 0.3% of bookings as "out-of-order" — seat-hold timestamps appearing earlier than the seat-availability timestamps that supposedly caused them. The first hypothesis was clock skew: NTP across the 1400-host fleet was synced to within ±2 ms, but a few hosts had drifted to 8–14 ms during peak load when CPU pressure starved chronyd. That hypothesis was wrong. Even hosts with sub-millisecond NTP showed the inversion. The second hypothesis was clock step from leap-second handling. Wrong — no leap seconds were active in April. The third hypothesis was a coding bug: maybe one of the services was logging time.time() before doing the work. Wrong: every service logged the time after acquiring the relevant database row.

The actual cause was none of the above. The audit ledger was being written from 14 services, each on a different host with a different NTP-sync history, and "did this hop happen before that hop" was being decided by comparing wall-clock microseconds across hosts. With NTP at ±2 ms baseline plus occasional ±10 ms drift under load, two events 0.3 ms apart could be recorded with wall-clocks in the wrong order — even when the causal order was unambiguous from the message flow. The reconciliation job was asking a wall-clock question that wall-clocks can never answer.

The fix was to add a Lamport timestamp to every audit entry, propagated through the RPC chain. The seat-availability service's response carried its current Lamport timestamp; the seat-hold service applied the receive rule to its own counter and stamped its log entry with the result. The reconciliation job switched from "sort by wall-clock" to "sort by (lamport_ts, service_id)". The 0.3% inversion rate dropped to zero on the next deploy, and the team re-ran the previous month's audit on the historical wall-clock-sorted data with a Lamport reconstruction (extracting the implicit happened-before relation from the request-trace IDs) and confirmed the inversions had been reporting artefacts, not actual booking-order bugs. The wall-clocks had been fine. The system had been fine. The audit had been wrong about the question it was asking.

The deeper lesson — and the reason this sits in the curriculum — is that wall-clocks are not the order primitive of distributed systems. They are an approximation that holds when events are far enough apart in time that NTP skew can't reorder them, and the threshold for "far enough" is not a fixed number — it grows with cluster size and load. Lamport timestamps, by contrast, give you the order primitive for free: one integer per message, no NTP, no skew, and the order they encode is exactly the order that matters for deciding what could have caused what.

Common confusions

Going deeper

State-machine replication via Lamport timestamps — the original 1978 algorithm

Lamport's 1978 paper does not stop at defining the clock. The paper goes on to use Lamport timestamps to build total-order broadcast — a protocol where every process delivers the same sequence of events in the same order, which is the foundation of state-machine replication. The construction: every process maintains a queue of pending requests, ordered by (lamport_ts, process_id). To deliver a request, a process must have an ack from every other process for that request, and every other process must have a request in their queue with a higher (lts, pid) than the request being delivered. The acknowledgements ensure no future request can have a smaller timestamp, which is exactly the condition needed to safely deliver. The protocol assumes no failures and FIFO message channels — both assumptions that real systems must lift. Real consensus protocols (Paxos, Raft, Zab) replace these assumptions with quorums and term-numbers, but the underlying ordering primitive is still Lamport-style: every committed entry has a unique increasing number, and that number is the ordering authority.

What zxid in ZooKeeper actually is

ZooKeeper's zxid (ZooKeeper transaction id) is a 64-bit integer attached to every committed write. The high 32 bits are the epoch (incremented on every leader election); the low 32 bits are the counter (incremented on every committed transaction within the epoch). This is recognisably a Lamport-style construction with a structure: every transaction has a unique increasing number that respects causality. The epoch acts as a coarse-grain Lamport clock advanced on leader change; the counter acts as a fine-grain Lamport clock advanced per transaction. Reads of the zxid give clients the total order of writes. Read-your-writes consistency is implemented as "send your last-seen zxid to the next read; the server blocks until it has caught up to that zxid". This is the Lamport-clock-as-causality-token pattern, productionised. CockroachDB's MVCCTimestamp, Spanner's commit_timestamp, and Kafka's (epoch, offset) pair are all variations on the same theme.

The cost of Lamport clocks in a production RPC chain

A Lamport timestamp adds 8 bytes to every RPC payload (a 64-bit counter; 32 bits is too few for a long-running cluster — 4B events at 100k events/sec is only 12 hours). On a 200-byte RPC, that is 4% overhead — negligible. On a 50-byte heartbeat, that is 16% — still negligible. The latency cost is one integer compare and one increment on every send and receive — sub-nanosecond on modern hardware. The CPU cost of Lamport clocks is, in practice, unmeasurable. The interesting cost is operational — every RPC tracing layer must propagate the timestamp, every audit log must record it, every service-mesh sidecar must understand it as a special header. A Lamport timestamp that is propagated by 13 of 14 services and dropped by 1 is worse than no Lamport timestamp at all, because the chain breaks invisibly at the gap.

Why Lamport timestamps cannot detect concurrency — the formal argument

The Lamport implication a → b ⟹ C(a) < C(b) is one-way. The reverse — C(a) < C(b) ⟹ a → b — fails because the construction collapses N processes' counters into a single integer, and information about which process generated the timestamp is lost. To detect concurrency (a ∥ b) you need to know both C(a) ≥ C(b) and C(b) ≥ C(a) cannot simultaneously hold, but with one integer per event, you only have one comparison. Vector clocks fix this by assigning a vector V_i ∈ ℕ^N to each event, where V_i[j] is the latest event from process j that P_i has heard about. Two events a on P_i and b on P_j are concurrent iff V(a)[i] > V(b)[i] AND V(b)[j] > V(a)[j] — neither vector dominates. The cost is O(N) per timestamp. The reduction in cost from O(N) (vector clocks) to O(1) (Lamport) is exactly the reduction in expressiveness from "concurrency detected" to "concurrency invisible".

Reproduce this on your laptop

# Reproduce the Lamport simulation
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 lamport_sim.py

# Try a more skewed workload — see what happens with mostly-local events
python3 -c "
import lamport_sim
procs = lamport_sim.simulate(n_proc=5, n_steps=200, seed=1)
all_evts = sorted([e for p in procs for e in p.log], key=lambda e: (e[3], e[1]))
print(f'final lts per process: {[p.lts for p in procs]}')
"

# Vary the message ratio in lamport_sim.py — change 0.4 to 0.1 and watch
# Lamport timestamps grow much more slowly (no max-receive jumps).

Where this leads next

This chapter named the Lamport clock as the simplest possible logical-time construction — one integer, three rules, total order consistent with causality. Two follow-on directions extend it.

The unifying pattern: Lamport timestamps are the partial-order primitive every causality-respecting distributed protocol descends from. Vector clocks add concurrency detection. HLCs add a wall-clock prefix for human readability. TrueTime adds a hardware-bounded uncertainty interval. Each layer adds capability and cost. Lamport's three-line construction is the floor — the minimum you need to talk about order in a distributed system without lying about it.

References

  1. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" — CACM 1978. The foundational paper. Read it; it's only 8 pages and the argument is elegant.
  2. Mattern, "Virtual Time and Global States of Distributed Systems" — Parallel and Distributed Algorithms 1989. Generalises Lamport to vector clocks; introduces the concurrency-detection criterion.
  3. Kulkarni et al., "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" — 2014 tech report introducing HLC. The synthesis between Lamport and physical time.
  4. Apache ZooKeeper internals — zxid — production example of a Lamport-style construction with epoch + counter structure.
  5. Kleppmann, Designing Data-Intensive Applications, Chapter 9 — O'Reilly 2017. The "Ordering Guarantees" section's treatment of Lamport timestamps as the ordering primitive.
  6. Wall clocks and NTP — internal cross-link to the chapter that explains why the wall clock is not the right answer for ordering.
  7. Schwarz & Mattern, "Detecting causal relationships in distributed computations: In search of the holy grail" — Distributed Computing 1994. The classic survey of the causality-detection problem and the Lamport / vector-clock spectrum.