Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
G-Set, 2P-Set, and OR-Set
PlayDream's "live participants" panel for a fantasy-cricket contest looks innocent — a set of user IDs that have joined this contest. Two replicas in two regions, both accepting joins; gossip every 200 ms. Add Riya's ID on Mumbai-edge while a partition isolates Bangalore-edge — when they reconnect, the union of both sets is the right answer. Easy. Now allow leaves: Riya leaves on Bangalore the same wall-clock second she rejoins on Mumbai. After gossip heals, what should the set contain? "Union" is wrong (she'd be present even though she chose to leave); "intersection" is wrong (a brand-new join on one side disappears); "last writer wins by timestamp" is wrong (clocks lie, and Bangalore's wall-clock might be 9 ms behind Mumbai's). The naive set CRDTs — G-Set and 2P-Set — each pick a different wrong answer. The OR-Set (observed-remove set) is the first design that gets it right, and the price is one extra bit of metadata per element: a unique tag.
A G-Set is a grow-only set; merge is union; you can never remove. A 2P-Set pairs an A (added) and an R (removed) G-Set; you can remove, but only an element you've already seen, and once removed it can never be re-added. An OR-Set tags every add with a unique identifier and removes only the tags it has observed; an element is "in the set" iff at least one of its tags has not been tombstoned. The OR-Set is the first set CRDT where add-wins concurrent semantics fall out cleanly, at the cost of unbounded tag growth in long-lived collaborative state.
The grow-only set: union as the join
Start with the simplest set CRDT. A G-Set stores a Python set and supports two operations: add(e) mutates the local set, and merge(other) does self.state |= other.state. The lattice is (P(V), ⊆) — the powerset of the universe V ordered by subset inclusion — and union is the join (least upper bound). Three laws fall out of set union:
- Idempotent:
S ∪ S = S. Re-receiving the same gossip changes nothing. - Commutative:
S ∪ T = T ∪ S. The order in which gossip arrives from peers does not matter. - Associative:
(S ∪ T) ∪ U = S ∪ (T ∪ U). Grouping does not matter.
Why these three suffice for convergence: Shapiro's 2011 CRDT theorem says any state-based replicated structure where the merge is the join of a join-semilattice converges, regardless of message ordering, duplication, or delay. G-Set sits in (P(V), ⊆, ∪) — a textbook semilattice — so the convergence proof is one line of algebra, not a protocol.
The G-Set is genuinely useful where the application semantics never need removal. PlayDream's "users who have ever played any contest this season" set is grow-only by definition — once you've played, the historical fact remains. CricStream's "edges that have served at least one viewer in this match" is grow-only. PaySetu's "merchant IDs that have processed at least one transaction since launch" is grow-only. Bloom filters and HyperLogLog sketches — both probabilistic structures — are essentially grow-only sets in disguise, and both merge by union.
The catch: many real workloads need removal. Group-chat membership (members leave). Cart contents (items get removed). Friend lists (someone unfriends). Open file handles (files get closed). For every such application, the G-Set is wrong by construction, and the natural next attempt is the 2P-Set.
2P-Set: tombstones, and the impossibility of re-adding
The two-phase set tries the same trick the PN-Counter pulled: pair two grow-only structures. Keep two G-Sets — A for added elements and R for removed elements (tombstones). The visible set is A \ R (elements added but not removed). Add inserts into A; remove inserts into R only if the element is currently in A (you cannot tombstone what you have not seen). Merge unions both A and R independently. The lattice is (P(V) × P(V), ⊆ × ⊆) with componentwise union as join — still a semilattice, still convergent.
The 2P-Set works for any application that adds and removes each element at most once. Permission grants and revocations: once revoked, a permission cannot be re-granted (you'd issue a new permission with a fresh identifier). Email-list unsubscribes. Lifetime-banned user IDs. The "phase one then phase two" name is literal: every element traverses A→R at most once, then is gone forever.
Why 2P-Set cannot re-add a removed element: the lattice formalism requires the merge to be monotonically growing. If a removed element could be re-added, the visible set would shrink-then-grow — that is not a monotone trajectory through (P(V), ⊆), so it is not expressible as a join of grow-only states. Any structure that allows "remove then re-add" must encode the re-add as a new element, distinct from the previous one. The OR-Set does exactly this with unique tags.
A 2P-Set group-chat membership goes wrong like this: Riya joins, Riya leaves, Riya wants to re-join. The 2P-Set has Riya in both A and R, so A \ R excludes her permanently. The application has to issue a fresh user ID for the re-join — which the user does not understand, because they are still themselves. This pathology is why 2P-Sets show up in the literature and almost never in production code.
A subtler 2P-Set pitfall: the size of R grows monotonically forever. In a chat application that sees 1M total joins and 1M total leaves over a year, the 2P-Set's state on each replica is 2M tombstoned IDs even though the visible set is empty. Tombstone GC requires coordination (you cannot drop an R entry without proving every replica has seen it), which defeats the no-coordination property the CRDT is meant to give you.
OR-Set: tag every add, remove only observed tags
The Observed-Remove Set is Shapiro's elegant fix. Every add(e) produces a unique tag t (a UUID, or a (replica_id, monotonic_counter) pair) and inserts the pair (e, t) into a grow-only set E. The visible set is the projection { e : (e, t) ∈ E for some t } — every distinct first-component appears in the visible set. Remove(e) on a replica scans E for all pairs whose first component is e that the replica has currently observed, and inserts each such tag into a grow-only tombstone set R. The visible set after remove is { e : ∃t such that (e, t) ∈ E and t ∉ R }.
The trick: a concurrent add(e) on a different replica creates a new tag, which the local remove(e) could not have seen and therefore could not have tombstoned. After gossip, the new tag survives, and e is back in the visible set. This gives add-wins semantics: when add and remove are concurrent, add wins. The OR-Set is the canonical add-wins set CRDT; its variant the Remove-Wins Set (RW-Set) flips the bias.
A runnable G-Set, 2P-Set, and OR-Set
The simulator below implements all three CRDTs in pure Python and runs the same scenario — two replicas, concurrent add and remove of the same element across a partition — on each. The output makes the divergence between the three designs concrete.
# set_crdts.py — G-Set, 2P-Set, OR-Set with the same partition+heal scenario
from dataclasses import dataclass, field
from typing import Set, Tuple, Hashable, Iterable
import uuid, copy
@dataclass
class GSet:
state: Set[Hashable] = field(default_factory=set)
def add(self, e): self.state.add(e)
def remove(self, e): raise NotImplementedError("G-Set is grow-only")
def value(self) -> Set: return set(self.state)
def merge(self, other: "GSet"): self.state |= other.state
@dataclass
class TwoPSet:
A: Set[Hashable] = field(default_factory=set) # added
R: Set[Hashable] = field(default_factory=set) # removed (tombstones)
def add(self, e): self.A.add(e)
def remove(self, e):
if e in self.A: self.R.add(e)
def value(self) -> Set: return self.A - self.R
def merge(self, other: "TwoPSet"):
self.A |= other.A
self.R |= other.R
@dataclass
class ORSet:
E: Set[Tuple[Hashable, str]] = field(default_factory=set) # (element, tag)
R: Set[str] = field(default_factory=set) # tombstoned tags
def add(self, e) -> str:
tag = uuid.uuid4().hex[:8]
self.E.add((e, tag))
return tag
def remove(self, e):
observed_tags = {t for (x, t) in self.E if x == e and t not in self.R}
self.R |= observed_tags
def value(self) -> Set:
return {e for (e, t) in self.E if t not in self.R}
def merge(self, other: "ORSet"):
self.E |= other.E
self.R |= other.R
def scenario(make):
A, B = make(), make()
A.add("riya") # t=1: A adds riya
B.merge(copy.deepcopy(A)) # gossip: B sees riya
# Partition starts. A removes; B concurrently re-adds.
try: A.remove("riya") # t=2 on A: remove (where supported)
except NotImplementedError: pass
B.add("riya") # t=3 on B: concurrent add
# Heal — bidirectional gossip.
A_snap, B_snap = copy.deepcopy(A), copy.deepcopy(B)
A.merge(B_snap); B.merge(A_snap)
return A.value(), B.value()
if __name__ == "__main__":
print("G-Set :", scenario(GSet)) # remove not supported; both end with {riya}
print("2P-Set :", scenario(TwoPSet)) # tombstone permanent; both end with {} or {riya}
print("OR-Set :", scenario(ORSet)) # add wins; both end with {riya}
Sample run:
G-Set : ({'riya'}, {'riya'})
2P-Set : (set(), set())
OR-Set : ({'riya'}, {'riya'})
Walk through the divergence:
- G-Set's answer is
{riya}— the only operation it can express is union; the user's "leave" call raisedNotImplementedErrorand was silently swallowed by thetry/except. This is technically convergent, but it's the wrong answer for any application where leaves are real. - 2P-Set's answer is
{}— onceriyawas added toAand then toRon replica A, the tombstone is permanent. Replica B's concurrent re-add insertsriyaintoB.A, gossip unions bothAs and bothRs, andA − R = {} ∪ {riya} − {riya} = {}. The re-add is silently lost. Why the loss is not a bug but a definitional property:2P-Setoperates on bare elements (no per-add tag), so the re-add cannot be distinguished from a duplicate of the original. The lattice has no room to encode "this is a fresh add, not the same one I already tombstoned" — the only way to express that is to lift the structure to (element, tag) pairs, which is exactly the OR-Set. - OR-Set's answer is
{riya}— replica A'sremoveonly tombstoned the tag it had observed (t1, created during the initial gossip). Replica B's concurrentadd("riya")minted a fresh tagt2that A never saw, sot2is not inR. After merge,(riya, t2)exists inEwith no matching tombstone, soriyais in the visible set. Add wins. And critically — the answer would be the same regardless of message-arrival order, because union ofEand union ofRare both commutative.
Run the OR-Set scenario a hundred times with shuffled gossip order, and every run gives {riya}. That is the meaning of "convergence by construction".
Production stories: who actually uses these
Riak's riak_dt_orset is a production OR-Set with one sharp refinement: tags are (replica_id, dot) pairs where dot is a per-replica monotonic counter, not a UUID. This compresses the wire format — instead of 16-byte UUIDs, tags are (replica_id, int) pairs that delta-CRDT encoding can run-length-compress. Riak's "DVV-Sets" (Dotted Version Vector Sets) push this further by representing the entire E as a per-replica dotted clock, dropping bytes-per-element from O(elements) to O(replicas).
Akka Distributed Data's ORSet ships add-wins semantics with the same dotted-version-vector approach. Akka's documentation explicitly calls out the tombstone-explosion failure mode: a long-lived OR-Set that has seen 1M adds and 1M removes carries 1M tombstones forever, even when the visible set is small. Akka's mitigation is ORSetKey with a TTL — applications that don't need full add-wins can opt for a less general but bounded-state variant.
Yjs and Automerge — collaborative-editing CRDTs used for shared documents — use OR-Set-like structures for their shared-list and shared-set primitives. When two Google-Docs-style users concurrently click "delete this paragraph" and "edit this paragraph", the OR-Set semantics ensure the edit wins (because the edit's add carries a tag the delete never saw, mirroring the riya-rejoin example). This is why Google Docs, Notion, and Linear-style tools converge intuitively under concurrent edits — the underlying data structure encodes "the user's intent to add survives concurrent attempts to remove".
MealRush's "items in cart" CRDT. MealRush — fictional food-delivery startup — initially used a single Postgres table with cart_id, item_id, deleted_at for cart contents, replicated cross-region with logical replication. During a Diwali promotion, two regions accepted the same user's "add biryani" and "remove biryani" within 250ms of each other; the wall-clock-based LWW logic dropped the user's intended add 11% of the time, and the support team logged a "phantom item disappearance" pattern that took weeks to diagnose. The fix was an OR-Set keyed by (item_id, region_local_dot); the phantom-disappearance rate dropped to 0%. The cost was tombstone growth — handled by a daily compaction job that drops tombstones older than 7 days, which is safe because cross-region gossip converges in under 10 seconds even during the worst incidents on record.
KapitalKite's watchlist. Users add and remove stock tickers from a watchlist; the watchlist must be immediately visible across web, iOS, and Android clients without a round-trip to a central server (otherwise the UI feels laggy). KapitalKite — fictional stockbroker — implemented the watchlist as a per-user OR-Set, with the iOS/Android/web clients each acting as a replica plus a server-side replica that handles gossip. Adding "TICKER42" on web while concurrently removing it on mobile resolves to "TICKER42 is in the watchlist" (add-wins), which is exactly what users expect. The team initially shipped a 2P-Set and got bug reports of "tickers won't come back after I remove them"; switching to OR-Set fixed the bug class outright.
Common confusions
- "OR-Set is just LWW with extra steps." It is not. LWW (last-writer-wins) compares wall-clock timestamps and picks one writer; under concurrent operations with skewed clocks, LWW silently picks the wrong one. OR-Set never compares timestamps — its add-wins property is a structural consequence of the tag scheme (a fresh tag from a concurrent add cannot have been tombstoned by anyone who didn't see it). LWW gives "the most recent write wins"; OR-Set gives "every concurrent add survives every concurrent remove that didn't observe it".
- "2P-Set is the same as OR-Set with one tag per element." Subtle but wrong. The defining property of OR-Set is that each
add(e)produces a fresh tag, even ifewas added before. 2P-Set has no per-add identity, so all adds of the same element are indistinguishable; the tombstone applies to the bare element and kills any future add. OR-Set's re-add creates a new (element, tag) pair that is structurally distinct from the tombstoned tags. - "You can implement OR-Set without UUIDs by using just timestamps." Only if your timestamps are guaranteed unique across replicas (which requires either a logical clock plus replica ID, like a
(replica_id, dot)pair, or a hybrid logical clock). Pure wall-clock timestamps collide across replicas in the same millisecond and break add-wins. The "UUID per add" version is correct but wasteful; the "(replica_id, dot)" version is the production-grade compression. - "OR-Set is unbounded, so it cannot be used in long-lived state." Tombstone growth is real, but the rate is bounded by the rate of removes, not the rate of adds. Steady-state applications where the visible set churns (chat memberships, cart items, watchlists) need a tombstone GC strategy: either a TTL-based local cleanup that runs after gossip is known to have converged, or an explicit "stable timestamp" coordination round that reclaims tombstones older than a threshold. Production CRDT databases (Riak, Akka) ship both options.
- "G-Set is useless because real applications need removal." Plenty of applications never remove. Distributed dedup sets ("packets I've already processed"), append-only audit logs treated as a set, content-addressed-storage manifests, "merchants who have ever transacted", "edges that have ever served traffic". The grow-only constraint is the right model when the application's invariant is monotonic.
- "OR-Set's add-wins is a tie-break rule for concurrent operations." It is not a tie-break — there is no comparison or ordering between concurrent operations. It is a structural property: a remove can only tombstone tags it has observed, and a concurrent add creates a tag it could not have observed. The asymmetry is baked into the data structure, not bolted on as a policy.
Going deeper
The (P(V × T), ⊆) lattice and why tags work
OR-Set's state space is a pair (E, R) where E ⊆ V × T (element-tag pairs) and R ⊆ T (tombstoned tags). Both components live in powerset lattices ordered by inclusion, and the join is componentwise union — so the joint structure is the product (P(V × T), ⊆) × (P(T), ⊆) with componentwise join. The visible-set projection π(E, R) = { e : ∃t with (e, t) ∈ E and t ∉ R } is not monotone in (E, R) in general — a remove adds tags to R, which can shrink the visible set. But the underlying state (E, R) is monotone, so convergence holds at the state level; the visible set is just a query that shrinks and grows as the underlying lattice grows monotonically. Why this distinction matters for proofs: textbook CRDT proofs require the state to live in a join-semilattice and be monotone. The visible set is allowed to oscillate as long as the underlying state grows. OR-Set is precisely this trick — encode a non-monotone observable (set membership with removal) as a query over a monotone state.
Delta-state OR-Sets and the dotted-version-vector trick
The naive OR-Set ships the entire E and R on every gossip round, which is O(|adds + removes|). For a 1000-replica cluster gossiping every 100 ms, with each replica seeing 10 adds/sec, the wire-byte cost is O(10⁶) per replica per round. Almeida, Shoker, and Baquero's 2014 delta-CRDT paper introduced dotted-version-vector OR-Sets: each replica maintains a per-replica monotonic counter (the "dot"), and the gossip delta is the set of (replica_id, dot, element) triples that have been added since the last sync. Removes ship the set of dots that have been tombstoned, not the original element. A 1000-replica cluster running DVV-OR-Set typically uses 10–100× less bandwidth than the naive OR-Set, because steady-state changes are deltas, not full-state snapshots.
Tombstone GC: the hardest problem in OR-Sets
Reclaiming tombstones requires proof that every replica has observed every tombstoned tag — otherwise dropping the tombstone re-resurrects a removed element. Three production approaches exist:
- TTL-based local cleanup: drop tombstones older than
T_max, whereT_max > T_gossip_convergence_p99 + safety_margin. Riak uses 24h by default; MealRush in the war story above uses 7 days. - Stable timestamps via Raft: a small Raft group periodically broadcasts "every replica has observed every operation up to timestamp t_stable"; tombstones older than
t_stablecan be GC'd. Adds a small coordination cost per epoch but bounds tombstone growth. - Causal stability: track per-replica vector clocks and drop tombstones whose dots are dominated by every replica's vector. The cleanest answer formally; the most expensive in implementation.
None of these are pure CRDT operations (they require either local timing assumptions or coordination), which is why they are layered on top of the lattice rather than being part of it.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 set_crdts.py
# Try the variations:
# - Run scenario(ORSet) 100 times with shuffled gossip order — observe value invariant
# - Modify ORSet.add to use wall-clock timestamps as tags — observe collision under fast adds
# - Build a Remove-Wins variant (RW-Set) — flip the bias: tombstones win over concurrent adds
Where this leads next
OR-Set is the prototype for every more elaborate set CRDT and many map CRDTs:
- LWW-Register and last-writer-wins — when a single value (not a set) needs the same conflict-resolution treatment.
- G-Counter and PN-Counter — the same "pair two grow-only structures" pattern that OR-Set uses for elements/tombstones.
- Conflict-free replicated data types (overview) — the broader family this article fits inside.
- Delta-state CRDTs — the bandwidth optimisation that makes OR-Set affordable in 1000-node meshes.
- State-based vs operation-based CRDTs — both delivery modes apply to OR-Set; op-based saves bandwidth at the cost of needing causal-delivery infrastructure.
Part 13 continues with sequence CRDTs (RGA, Logoot, LSEQ — used by collaborative-editing tools), JSON CRDTs (Automerge, Yjs), and the formal limits of what CRDTs cannot model (no general set difference, no division for counters, no max-of-incomparable-sets). Part 17's geo-distribution chapters revisit OR-Set under cross-region partitions, where convergence-time analysis becomes far less benign than the LAN case shown here.
References
- Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. — "Conflict-free Replicated Data Types" (SSS 2011). The OR-Set is introduced in §3.5; G-Set and 2P-Set in §3.1–3.2.
- Bieniusa, A., Zawirski, M., Preguiça, N., Shapiro, M., Baquero, C., Balegas, V., Duarte, S. — "Brief Announcement: Semantics of Eventually Consistent Replicated Sets" (DISC 2012). The formal treatment of add-wins vs remove-wins set semantics.
- Almeida, P., Shoker, A., Baquero, C. — "Delta State Replicated Data Types" (JPDC 2014). The optimisation that makes OR-Set bandwidth-competitive with operation-based variants.
- Preguiça, N., Baquero, C., Almeida, P., Fonte, V., Gonçalves, R. — "Dotted Version Vectors: Logical Clocks for Optimistic Replication" (arXiv 2010). The DVV trick used by Riak's
orsetto compress tags. - Riak Data Types documentation —
riak_dt_orsetproduction implementation; tombstone GC and the dotted-version-vector encoding. - Akka Distributed Data documentation — the production
ORSetwith delta-CRDT support and TTL-based cleanup. - Kleppmann, M., Howard, H. — "Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases" (arXiv 2020). Why OR-Set's correctness assumes non-Byzantine replicas.
- G-Counter and PN-Counter — the simpler precursor that OR-Set's design pattern generalises.