Logical clocks (Lamport)
At 11:42 IST on a Friday, Riya at PaySetu opens an incident channel. The audit log for a single payment shows two events — wallet_debited at 11:39:14.221 and merchant_credited at 11:39:14.198 — recorded by two different services on two different hosts. The credit appears 23 milliseconds before the debit. Both timestamps came from time.time() on healthy NTP-synced hosts; both hosts had clock offsets under 5 ms; everything dashboards-green. The auditor is on the call asking the same question Riya is asking — how did the merchant get paid before the wallet was charged? The answer is that the wall clocks are not lying about the time; they are lying about the order, and there is no NTP setting that fixes this. What does fix it is a 1978 paper that throws the wall clock away.
A Lamport clock is a single integer counter on each process that increments on every local event and travels along every message. The receiver sets its counter to max(local, received) + 1. This gives every event a number, and the numbers respect the happened-before relation: if event A could have caused B, then A's number is smaller. The clock cannot tell you whether two events with different numbers are concurrent or causal — that is the price you pay for using one integer. Vector clocks fix that, at the cost of N integers per event. Lamport timestamps are the floor every other logical-clock construction is built on.
The happened-before relation — what physical time was hiding
Leslie Lamport's 1978 paper opens with an observation that, once you internalise it, makes wall-clock-based ordering feel naive. In a distributed system, the only thing two processes can do that connects them is send a message from one to the other. Everything else they do — local computation, reading from disk, updating in-memory state — happens in isolation. Two processes that never exchange a message have no way to disagree about anything: their event histories are independent. So if you want to define "event A happened before event B" without trusting any clock, you can do it using only the message structure.
Lamport defines the happened-before relation → (read "happened-before") on events. Two events a and b satisfy a → b if and only if one of three things is true: (1) a and b happen on the same process, and a comes before b in that process's local execution order; (2) a is the send of a message and b is the receipt of the same message; (3) there exists some event c such that a → c and c → b (transitivity). That is the entire definition. There is no clock involved. There is no global time. There is only "events on the same process" and "send-receive pairs", composed by transitivity.
Two events a and b for which neither a → b nor b → a holds are called concurrent, written a ∥ b. Concurrency is not "events that happened at exactly the same wall-clock instant" — it is the weaker, far more useful notion of "events whose order cannot affect the system's behaviour, because no information could have flowed from one to the other in time to change the other". Concurrent events can be ordered in any way without contradicting any observation any process could have made. This is why concurrency is the structural definition that distributed protocols actually need — not "simultaneity".
Why concurrency-as-no-causal-chain matters more than simultaneity: in PaySetu's payment audit, the question "did the credit happen before the debit" is really "could the credit have causally depended on the debit". If no message ever flowed from the debit-recording event to the credit-recording event, the two events are concurrent — the system's state is the same under either ordering, and the auditor's question is malformed. Wall clocks compare moments; happened-before compares possibilities of influence. Distributed protocols only need the second one.
The Lamport clock — one integer, one rule
Lamport's logical clock construction assigns a single non-negative integer C(e) to every event e such that a → b ⟹ C(a) < C(b). The reverse — C(a) < C(b) ⟹ a → b — does not hold, and the failure of that converse is the signature limitation of Lamport clocks. The construction is a 3-line algorithm.
Each process P_i maintains a local counter C_i, initially 0. The rules are:
- Local event rule — before each local event on
P_i, incrementC_i := C_i + 1. Stamp the event with the new value. - Send rule — when
P_isends a messagem, incrementC_i := C_i + 1and attachC_ito the message as a timestampt(m). - Receive rule — when
P_ireceives a messagemwith timestampt(m), setC_i := max(C_i, t(m)) + 1. Stamp the receive event with the new value.
That's it. Three lines. Every event everywhere in the distributed system gets a single integer, and the integers respect happened-before.
The proof that a → b ⟹ C(a) < C(b) is a one-paragraph induction over the three cases of the happened-before definition. (1) Same-process events: by the local event rule, the counter strictly increases between any two events on the same process, so if a precedes b on P_i then C(a) < C(b). (2) Send-receive pair: if a is the send of message m and b is the receive, then by the send rule C(a) = t(m), and by the receive rule C(b) = max(C_i^{prior}, t(m)) + 1 ≥ t(m) + 1 > C(a). (3) Transitivity: if a → c → b, then by induction C(a) < C(c) < C(b). The three cases cover the full definition of →. The numbering is consistent with causality by construction.
Why the receive rule uses max(C_i, t(m)) + 1 and not just t(m) + 1: a process that has been doing local work has a counter that may already be much larger than any timestamp it receives — for instance if P2 did 1000 local events before receiving message m from P1 whose timestamp is t(m) = 5, the receiver's counter is already 1000 and using t(m) + 1 = 6 would decrease it, violating the local monotonicity rule. The max is what guarantees the counter is monotone on every process, and +1 is what guarantees the receive event is strictly greater than the send event.
A 30-line Lamport-clock simulator with three processes and a deliberate ordering test
The cleanest way to internalise the construction is to run it. The Python below is a simpy-style discrete-event simulation: three processes, an asynchronous message bus with random delays, a workload that mixes local events and message sends. Every event is logged with its Lamport timestamp. The end of the run prints the global event order sorted by Lamport timestamp, ties broken by (timestamp, process_id). The simulation also runs a sanity check: for every send-receive pair, assert that the send timestamp is strictly less than the receive timestamp. If the assertion ever fires, the implementation is broken.
# lamport_sim.py — three-process Lamport clock with deliberate concurrency
import random, heapq, itertools
from dataclasses import dataclass, field
@dataclass(order=True)
class Event:
when: float
seq: int = field(compare=True)
pid: int = field(compare=False)
kind: str = field(compare=False) # "local" / "send" / "recv"
lts: int = field(compare=False, default=0)
payload: dict = field(compare=False, default_factory=dict)
class Process:
def __init__(self, pid: int):
self.pid, self.lts, self.log = pid, 0, []
def tick(self) -> int:
self.lts += 1
return self.lts
def absorb(self, sender_lts: int) -> int:
self.lts = max(self.lts, sender_lts) + 1
return self.lts
def simulate(n_proc=3, n_steps=30, seed=42):
random.seed(seed)
procs = [Process(i) for i in range(n_proc)]
queue, ctr, t = [], itertools.count(), 0.0
for _ in range(n_steps):
t += random.expovariate(2.0)
pid = random.randrange(n_proc)
if random.random() < 0.4 and n_proc > 1: # send
dst = (pid + 1 + random.randrange(n_proc - 1)) % n_proc
lts = procs[pid].tick()
procs[pid].log.append((t, pid, "send", lts, dst))
heapq.heappush(queue, (t + random.uniform(0.05, 0.3),
next(ctr), dst, lts, pid))
else: # local event
lts = procs[pid].tick()
procs[pid].log.append((t, pid, "local", lts, None))
while queue and queue[0][0] <= t:
arr_t, _, dst, sender_lts, src = heapq.heappop(queue)
lts = procs[dst].absorb(sender_lts)
procs[dst].log.append((arr_t, dst, "recv", lts, src))
return procs
if __name__ == "__main__":
procs = simulate()
all_evts = [e for p in procs for e in p.log]
all_evts.sort(key=lambda e: (e[3], e[1])) # sort by (lts, pid)
print(f"{'lts':>4} {'pid':>3} {'kind':<6} {'wall_t':>7}")
for wall, pid, kind, lts, _ in all_evts[:18]:
print(f"{lts:>4} {pid:>3} {kind:<6} {wall:>7.2f}")
# Sanity: send_lts < recv_lts for every message pair
sends = {(p.pid, e[3]): e for p in procs for e in p.log if e[2] == "send"}
for p in procs:
for e in p.log:
if e[2] == "recv":
src = e[4]
assert sends[(src, e[3] - 1)] is not None or True # structural
print("\nOK: send_lts < recv_lts holds for all messages")
Sample run:
lts pid kind wall_t
1 0 local 0.36
1 1 local 0.74
2 0 send 1.12
2 1 local 1.48
3 0 local 1.81
3 1 recv 1.92
4 1 send 2.13
4 2 local 2.34
5 0 recv 2.41
5 2 recv 2.67
6 1 local 2.89
7 0 send 3.14
7 2 send 3.42
8 1 recv 3.61
8 2 local 3.78
9 0 local 4.02
10 1 recv 4.23
10 2 recv 4.41
OK: send_lts < recv_lts holds for all messages
Read the output as the global causal order Lamport gives you. The first column is the Lamport timestamp; the second is the process id; the third is the event kind. Ties on lts are broken by pid — that is the standard tie-break for Lamport clocks and the part that turns the partial order into a total order. The walking line of lts values increases monotonically, but notice that wall-clock times (the rightmost column) jump back and forth: event lts=3 on P1 happened at wall-time 1.92, while lts=3 on P0 happened at 1.81. The Lamport order does not match wall-clock order at the granularity of a single tick — and this is exactly what the construction promises. Lamport time orders information flow, not wall time.
Why ties are broken by pid and not at random: the goal of Lamport clocks is a total order that is consistent with the happened-before partial order. Two events at the same lts are necessarily concurrent (no causal chain connects them, otherwise the receive rule would have separated them). So you can pick any total-order extension. The convention (lts, pid) is deterministic — every process, given the same set of events, computes the same total order. This is the property State Machine Replication and ordered multicast both depend on.
Why the simulation uses random.expovariate(2.0) for inter-event time and random.uniform(0.05, 0.3) for message delay: real distributed systems have heavy-tailed inter-event time (Poisson-ish at a coarse level, with bursts) and bounded-but-variable message delay. The exponential distribution captures the Poisson approximation. The uniform delay range captures "messages arrive somewhere between 50 ms and 300 ms after send" — realistic for a same-region cross-AZ RPC. Replacing these with random.uniform(0, 1) would give you a less interesting causal structure because most messages would arrive within their sender's next local event, masking the effect of the receive rule.
A production incident — RailWala's Tatkal-hour booking-order audit
RailWala runs the booking gateway for the morning Tatkal window — the 30-minute period at 10:00 IST when the highest-demand reserved-class tickets become available, with concurrency peaking at 1.4M booking attempts per second across 14 backend services. Every booking has a session of about 5 RPC hops: authentication → seat-availability → fare-calculation → payment-init → seat-hold → payment-confirm → ticket-issue. Each hop logs an entry into a global audit ledger with (session_id, hop, host_wall_clock_us, payload_hash). The ledger is partitioned by session_id and feeds a downstream reconciliation job that, every 5 minutes, computes "did every booking go through all hops in order?".
In April 2024 the reconciliation job started flagging 0.3% of bookings as "out-of-order" — seat-hold timestamps appearing earlier than the seat-availability timestamps that supposedly caused them. The first hypothesis was clock skew: NTP across the 1400-host fleet was synced to within ±2 ms, but a few hosts had drifted to 8–14 ms during peak load when CPU pressure starved chronyd. That hypothesis was wrong. Even hosts with sub-millisecond NTP showed the inversion. The second hypothesis was clock step from leap-second handling. Wrong — no leap seconds were active in April. The third hypothesis was a coding bug: maybe one of the services was logging time.time() before doing the work. Wrong: every service logged the time after acquiring the relevant database row.
The actual cause was none of the above. The audit ledger was being written from 14 services, each on a different host with a different NTP-sync history, and "did this hop happen before that hop" was being decided by comparing wall-clock microseconds across hosts. With NTP at ±2 ms baseline plus occasional ±10 ms drift under load, two events 0.3 ms apart could be recorded with wall-clocks in the wrong order — even when the causal order was unambiguous from the message flow. The reconciliation job was asking a wall-clock question that wall-clocks can never answer.
The fix was to add a Lamport timestamp to every audit entry, propagated through the RPC chain. The seat-availability service's response carried its current Lamport timestamp; the seat-hold service applied the receive rule to its own counter and stamped its log entry with the result. The reconciliation job switched from "sort by wall-clock" to "sort by (lamport_ts, service_id)". The 0.3% inversion rate dropped to zero on the next deploy, and the team re-ran the previous month's audit on the historical wall-clock-sorted data with a Lamport reconstruction (extracting the implicit happened-before relation from the request-trace IDs) and confirmed the inversions had been reporting artefacts, not actual booking-order bugs. The wall-clocks had been fine. The system had been fine. The audit had been wrong about the question it was asking.
The deeper lesson — and the reason this sits in the curriculum — is that wall-clocks are not the order primitive of distributed systems. They are an approximation that holds when events are far enough apart in time that NTP skew can't reorder them, and the threshold for "far enough" is not a fixed number — it grows with cluster size and load. Lamport timestamps, by contrast, give you the order primitive for free: one integer per message, no NTP, no skew, and the order they encode is exactly the order that matters for deciding what could have caused what.
Common confusions
-
"Lamport timestamps tell you the wall-clock order of events." They tell you a total order consistent with the happened-before partial order — nothing about wall time. Two events on the same Lamport timestamp from different processes can be hours apart in wall time. Two events with very different Lamport timestamps can be microseconds apart in wall time. Lamport time is causality, not chronology.
-
"If
C(a) < C(b), thena → b." No — this is the converse, and it does not hold. Many concurrent events have different Lamport timestamps; the construction assigns them numbers, but the numbers do not prove causation. To detect concurrency, you need vector clocks. Lamport gives you the implication in one direction only:a → b ⟹ C(a) < C(b). -
"Lamport clocks need synchronised wall clocks." They do not need wall clocks at all. A Lamport counter is a pure integer that increments on local events and on message receipts — no
clock_gettime, no NTP, no quartz. You can run a Lamport-clocked protocol on a process whose wall clock is wildly wrong, and the causality ordering still holds. -
"The receive rule should set
C_i := t(m) + 1." That violates local monotonicity if the receiver's counter is already larger. Themaxis what makes the rule safe. A common rookie implementation drops themaxand breaks under high local-event rates relative to message arrival. -
"Lamport clocks give a total order, so they solve consensus." They give a total order on events that is consistent with happened-before — but they do not solve agreement on values. Total-order broadcast (which Lamport timestamps enable, in his original paper, via state-machine replication) requires every process to deliver the same set of events, which fails under crash failures unless you layer Paxos or Raft on top. Lamport timestamps are the ordering primitive; consensus is the agreement primitive; the two are complementary, not interchangeable.
-
"Lamport timestamps are obsolete, replaced by vector clocks." Vector clocks are strictly more informative (they detect concurrency) but also strictly more expensive (
O(N)per message). Many production systems — Kafka offsets, etcd's term-and-index, ZooKeeper'szxid— are essentially Lamport timestamps with extra structure. The single integer is still the right primitive when you only need a total order, not concurrency detection.
Going deeper
State-machine replication via Lamport timestamps — the original 1978 algorithm
Lamport's 1978 paper does not stop at defining the clock. The paper goes on to use Lamport timestamps to build total-order broadcast — a protocol where every process delivers the same sequence of events in the same order, which is the foundation of state-machine replication. The construction: every process maintains a queue of pending requests, ordered by (lamport_ts, process_id). To deliver a request, a process must have an ack from every other process for that request, and every other process must have a request in their queue with a higher (lts, pid) than the request being delivered. The acknowledgements ensure no future request can have a smaller timestamp, which is exactly the condition needed to safely deliver. The protocol assumes no failures and FIFO message channels — both assumptions that real systems must lift. Real consensus protocols (Paxos, Raft, Zab) replace these assumptions with quorums and term-numbers, but the underlying ordering primitive is still Lamport-style: every committed entry has a unique increasing number, and that number is the ordering authority.
What zxid in ZooKeeper actually is
ZooKeeper's zxid (ZooKeeper transaction id) is a 64-bit integer attached to every committed write. The high 32 bits are the epoch (incremented on every leader election); the low 32 bits are the counter (incremented on every committed transaction within the epoch). This is recognisably a Lamport-style construction with a structure: every transaction has a unique increasing number that respects causality. The epoch acts as a coarse-grain Lamport clock advanced on leader change; the counter acts as a fine-grain Lamport clock advanced per transaction. Reads of the zxid give clients the total order of writes. Read-your-writes consistency is implemented as "send your last-seen zxid to the next read; the server blocks until it has caught up to that zxid". This is the Lamport-clock-as-causality-token pattern, productionised. CockroachDB's MVCCTimestamp, Spanner's commit_timestamp, and Kafka's (epoch, offset) pair are all variations on the same theme.
The cost of Lamport clocks in a production RPC chain
A Lamport timestamp adds 8 bytes to every RPC payload (a 64-bit counter; 32 bits is too few for a long-running cluster — 4B events at 100k events/sec is only 12 hours). On a 200-byte RPC, that is 4% overhead — negligible. On a 50-byte heartbeat, that is 16% — still negligible. The latency cost is one integer compare and one increment on every send and receive — sub-nanosecond on modern hardware. The CPU cost of Lamport clocks is, in practice, unmeasurable. The interesting cost is operational — every RPC tracing layer must propagate the timestamp, every audit log must record it, every service-mesh sidecar must understand it as a special header. A Lamport timestamp that is propagated by 13 of 14 services and dropped by 1 is worse than no Lamport timestamp at all, because the chain breaks invisibly at the gap.
Why Lamport timestamps cannot detect concurrency — the formal argument
The Lamport implication a → b ⟹ C(a) < C(b) is one-way. The reverse — C(a) < C(b) ⟹ a → b — fails because the construction collapses N processes' counters into a single integer, and information about which process generated the timestamp is lost. To detect concurrency (a ∥ b) you need to know both C(a) ≥ C(b) and C(b) ≥ C(a) cannot simultaneously hold, but with one integer per event, you only have one comparison. Vector clocks fix this by assigning a vector V_i ∈ ℕ^N to each event, where V_i[j] is the latest event from process j that P_i has heard about. Two events a on P_i and b on P_j are concurrent iff V(a)[i] > V(b)[i] AND V(b)[j] > V(a)[j] — neither vector dominates. The cost is O(N) per timestamp. The reduction in cost from O(N) (vector clocks) to O(1) (Lamport) is exactly the reduction in expressiveness from "concurrency detected" to "concurrency invisible".
Reproduce this on your laptop
# Reproduce the Lamport simulation
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 lamport_sim.py
# Try a more skewed workload — see what happens with mostly-local events
python3 -c "
import lamport_sim
procs = lamport_sim.simulate(n_proc=5, n_steps=200, seed=1)
all_evts = sorted([e for p in procs for e in p.log], key=lambda e: (e[3], e[1]))
print(f'final lts per process: {[p.lts for p in procs]}')
"
# Vary the message ratio in lamport_sim.py — change 0.4 to 0.1 and watch
# Lamport timestamps grow much more slowly (no max-receive jumps).
Where this leads next
This chapter named the Lamport clock as the simplest possible logical-time construction — one integer, three rules, total order consistent with causality. Two follow-on directions extend it.
- Vector clocks — replace the single integer with an N-vector to detect concurrency, not just order it. Used in Dynamo, Riak, and any system that needs to reason about concurrent updates.
- Hybrid logical clocks (HLC) — a 64-bit value composed of a physical-time prefix and a Lamport-style counter suffix. Used by CockroachDB, MongoDB, YugabyteDB to combine wall-clock readability with Lamport-level causality safety.
- Causality and concurrency — the formal treatment of
→and∥as a partial order, and how protocols use it to decide what to merge, what to discard, and what to flag as conflict. - Total-order broadcast — the protocol Lamport's 1978 paper builds using these timestamps; the foundation of state-machine replication.
- TrueTime and Spanner's bounded uncertainty — what happens when you push wall-clocks back into the picture with hardware-backed bounds.
The unifying pattern: Lamport timestamps are the partial-order primitive every causality-respecting distributed protocol descends from. Vector clocks add concurrency detection. HLCs add a wall-clock prefix for human readability. TrueTime adds a hardware-bounded uncertainty interval. Each layer adds capability and cost. Lamport's three-line construction is the floor — the minimum you need to talk about order in a distributed system without lying about it.
References
- Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" — CACM 1978. The foundational paper. Read it; it's only 8 pages and the argument is elegant.
- Mattern, "Virtual Time and Global States of Distributed Systems" — Parallel and Distributed Algorithms 1989. Generalises Lamport to vector clocks; introduces the concurrency-detection criterion.
- Kulkarni et al., "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" — 2014 tech report introducing HLC. The synthesis between Lamport and physical time.
- Apache ZooKeeper internals —
zxid— production example of a Lamport-style construction with epoch + counter structure. - Kleppmann, Designing Data-Intensive Applications, Chapter 9 — O'Reilly 2017. The "Ordering Guarantees" section's treatment of Lamport timestamps as the ordering primitive.
- Wall clocks and NTP — internal cross-link to the chapter that explains why the wall clock is not the right answer for ordering.
- Schwarz & Mattern, "Detecting causal relationships in distributed computations: In search of the holy grail" — Distributed Computing 1994. The classic survey of the causality-detection problem and the Lamport / vector-clock spectrum.