Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Meta: scaling the social graph
It is 19:00 IST on a Saturday and Aanya, a CricStream user, opens an app to comment on a friend's post about a cricket final. The app needs, in the time it takes her thumb to settle on the screen: who her friend is, what the post says, the last fifty comments, which of those commenters Aanya knows, the like-count, whether Aanya already liked the post, and whether her friend's privacy settings let Aanya comment at all. Eight or nine reads from the social graph, every one of them a join across friends-of-friends, all in under 100ms. No relational join planner survives this. No primary-replica MySQL at the friend-list scale of a billion users survives it. The reason Meta's stack looks the way it does — and looks different from Amazon's cells and Google's vertical integration — is that this single workload, read-mostly graph traversal at extreme cardinality, drove every architectural decision from the storage layer up to the regional topology.
Meta's distributed stack is built around TAO — the social-graph cache layer — which serves graph reads (who-likes-what, who-friends-whom, what-comments-on-what) at billions of queries per second. TAO sits in front of a sharded MySQL data plane and exploits two facts: graph reads are vastly more frequent than writes, and stale-by-seconds answers are acceptable for nearly every read. The stack achieves this with a regional-leader pattern, single-master writes per shard, and a deliberate choice of eventual consistency that lets caches in every region serve fresh-enough data without coordinating on each read. This chapter walks through the graph data model, TAO's cache hierarchy, the regional-leader topology, and the consistency trade-offs that make the whole thing tractable.
The social graph as a data model
The first thing to understand about Meta's stack is that everything is a graph object or a graph association — and that this single representation choice cascades into every other design decision. A user is an object. A post is an object. A photo is an object. A "user X likes post Y" is an association. A "user X is friends with user Y" is an association. A "comment C is on post P" is an association. The schema for objects is sparse — different object types have different fields — but the schema for associations is uniform: a typed edge from one object to another, with a small payload and a timestamp.
This uniformity is the reason TAO can be a single service that handles every kind of read. There are no per-feature caches, no per-feature query planners, no per-feature replication strategies. There is exactly one read API — assoc_range(id, type, offset, limit) and assoc_count(id, type) and obj_get(id) — and every product feature is a composition of those calls.
obj_get + assoc_range calls — which is what makes a single cache layer (TAO) sufficient for the entire product surface.Why a uniform association model beats per-feature schemas: a relational schema with friends, likes, comments as separate tables forces the application to know which table to query for which feature. Adding a new feature (say "saved posts") means adding a new table, a new query path, a new caching strategy. The graph model trades a slightly fattier per-row representation for the property that any new feature is just a new edge type — no schema migration, no new cache, no new query planner. At Meta's release cadence (multiple feature ships per week, hundreds of teams), this is the difference between a stack that scales with feature count and a stack that doesn't.
The objects and associations are stored in sharded MySQL. The shard key is the object id for objects, and the from-id of the edge for associations — which means all edges out of a given user (their friend list, their like list, their post list) sit on the same shard. This co-location is what makes assoc_range(aanya, friends, 0, 200) a single-shard query rather than a fan-out. Edges into a user (who-friended-aanya, who-liked-aanya's-photo) require a separate index and live on a different shard, but for the friends edge type — which is symmetric — Meta stores both directions to keep both queries single-shard. The asymmetry is paid for in storage; the simplicity is paid back at every read.
TAO: the read-through cache for the entire graph
The MySQL data plane could not, on its own, serve the read traffic. A timeline render that needs eight or nine graph reads, multiplied by 3 billion daily active users multiplied by tens of timeline renders per session, comes out to tens of trillions of graph reads per day. TAO's job is to absorb every single one of those reads, with MySQL only seeing the cache misses. The numbers Meta has published put the read-to-write ratio in the social graph at roughly 500:1, and TAO's cache hit rate above 99% — so MySQL sees less than 1% of the read load and almost all of the write load.
TAO is structured as a two-tier cache: a leader tier per shard (one or a small replica set, co-located with the MySQL master for that shard) and a follower tier spread across every datacenter. A read goes to the local follower; on a miss, the follower asks its leader; on a leader miss, the leader reads from MySQL. Writes always go to the leader, which writes through to MySQL synchronously and then invalidates or refills the follower caches asynchronously.
The leader tier is what makes single-master writes per shard possible at Meta's scale. The leader is a thin caching layer with strong-enough consistency for its shard — every write for shard 42 goes through the same leader, the leader applies them in order, and the leader is the source of truth for the cache state of that shard. Why have a leader at all rather than letting followers read straight from MySQL: the leader collapses a thundering herd. When a popular post is created and 50 datacenters each have 1000 followers that miss on it simultaneously, without a leader you'd hit MySQL with 50,000 reads of the same row. With a leader, each region's followers ask their region's leader, which deduplicates the reads at the leader level, and only the home-region leader actually goes to MySQL — and even then with a single read. The leader tier exists primarily for read-coalescing and write-serialisation, not for cache hit rate.
Cache hit rate is dominated by the follower tier, which is huge — many terabytes per region — and warm because the social graph has heavy access locality (your friends' content is reread every time you scroll, and your friends are a small fraction of all users globally). A typical follower cache holds the working set for the users currently active in that region, with eviction by approximate-LRU, and a hit rate north of 99% on most shards.
Regional leaders and the single-master write path
Every shard has one home region — the region whose datacenter owns the MySQL master for that shard. Writes for that shard must go to that region. A user in Region B writing a comment on a shard-42 post sends the write to the home-region leader (Region A) over the WAN, the home-region leader writes through to MySQL master, and then invalidates fan out asynchronously to all other regions.
This is the single-master per-shard pattern at planet scale. It is not the same as Spanner's globally-replicated transactions: TAO does not coordinate writes across shards (no two-phase commit, no global timestamps). It is not the same as Dynamo-style per-region masters either: there is exactly one master per shard worldwide, not one per region. The pattern only works because the social graph's writes are dominated by single-shard operations — a like is one edge, a comment is one edge plus one object, a friend request is one edge in each direction (handled as two writes, sometimes seen as inconsistent for a brief window). Cross-shard transactions (the rare "transfer X from user A to user B") are not supported in TAO; the application must layer those on top with its own retry-and-reconcile logic.
The cost of single-master writes is WAN write latency. Aanya in Mumbai writing a comment on a shard whose home is in California pays the 200ms round-trip on the write. The cost of avoiding it would be multi-master per-shard with cross-region conflict resolution, which the social-graph workload deliberately rejected because the conflict resolution semantics for "two users liked the same post on different shards" are fine but the semantics for "two users renamed the same group simultaneously" are not — and exposing those semantics to product engineers is a tax on every feature.
The compromise Meta arrived at is read-your-writes within a region via a follower-leader cache update protocol. After Aanya's write returns successfully from the home-region leader, her local follower in Mumbai is immediately updated with the new value, even though MySQL replication and other regions' invalidates are still in flight. The next read from Aanya's session — even if it's served by her local follower — will see her own write. Other users in other regions may not see it for another second or two. This is a session-scoped consistency contract: read-your-writes for the writer, eventual consistency for everyone else.
# tao_read_path_simulation.py
# Simulate TAO's three-layer read path with follower hits, leader hits, and MySQL fallthroughs.
# This shows where the cache hit rate comes from and what happens on a popular-post thunder.
import random, statistics, collections
NUM_FOLLOWERS = 100 # follower cache instances in one region
FOLLOWER_HIT_RATE = 0.93 # baseline cache hit at follower
LEADER_HIT_RATE = 0.85 # of follower misses, leader hits this fraction
NUM_REQUESTS = 200_000
random.seed(7)
counts = collections.Counter()
for _ in range(NUM_REQUESTS):
if random.random() < FOLLOWER_HIT_RATE:
counts['follower_hit'] += 1
else:
if random.random() < LEADER_HIT_RATE:
counts['leader_hit'] += 1
else:
counts['mysql_read'] += 1
total = sum(counts.values())
print(f"Total reads: {total:,}")
print(f"Follower hits: {counts['follower_hit']:,} ({counts['follower_hit']/total*100:.2f}%)")
print(f"Leader hits: {counts['leader_hit']:,} ({counts['leader_hit']/total*100:.2f}%)")
print(f"MySQL fallthrough:{counts['mysql_read']:,} ({counts['mysql_read']/total*100:.2f}%)")
print()
# Effective amplification: how many MySQL reads per million app reads?
mysql_per_million = counts['mysql_read'] / total * 1_000_000
print(f"MySQL reads per 1M app reads: {mysql_per_million:,.0f}")
Sample output on a PaySetu analysis box:
Total reads: 200,000
Follower hits: 186,196 (93.10%)
Leader hits: 11,765 (5.88%)
MySQL fallthrough:2,039 (1.02%)
MySQL reads per 1M app reads: 10,195
Walkthrough: 93% of reads are absorbed by the follower tier (typical for a warm cache with locality of access). Of the 7% that miss the follower, 85% hit the leader (because the leader is much larger and pools requests across all followers in its region). Only the residual 1% reach MySQL. This 100:1 amplification reduction is what makes the entire stack economically possible — without TAO, Meta would need ~100× the MySQL fleet to serve the same read load. Why this matters more than raw cache hit rate: the leader tier is what catches the thundering herd — a popular post being read simultaneously by every follower in a region. Without the leader, each follower miss would be an independent MySQL hit; with the leader, the 100 follower misses are coalesced into 1 leader-to-MySQL read. The leader is not improving the steady-state hit rate much; it's improving the worst-case fanout, which is what dominates incident-time MySQL load.
The consistency model — eventual, but bounded
TAO's consistency model is the most-misunderstood part of the design, because the casual description "eventually consistent" hides several different guarantees layered on top of each other.
Within a single shard, the leader serialises writes — so two writes to the same shard are linearizable as seen by any single leader-reading client. Across shards, writes are not coordinated; the order in which "Aanya liked Post 8821" and "Rohan liked Photo 4422" become visible depends on which region you ask. Within a session (a single user's reads), the follower-update-on-write protocol gives read-your-writes — Aanya always sees her own writes immediately. Across sessions, reads are eventually consistent with bounded staleness — typically under 1 second worldwide, longer during cross-region replication incidents.
The product implication is that the application must be designed for stale reads to be tolerable. A like count that says 47 when the truth is 48 is fine. A friend list that's missing a friend who was added 200ms ago is fine. A comment that doesn't appear for half a second on the friend's screen is fine. The rare cases where stale reads aren't fine — say, a user blocking another user, where the block must take effect immediately — are handled with explicit consistency escapes: the application can request a "consistent read" that bypasses follower caches and reads directly from the home-region leader. These escapes are rare (well under 1% of reads) but they are the safety valve for the feature surfaces where eventual consistency is unsafe.
CricStream, building a comments feature on top of a TAO-equivalent graph cache, would discover this contract the hard way: a comment posted on a friend's match analysis post is visible to the poster instantly (read-your-writes in their region) but appears for the friend in another region a second later. For 99% of users on 99% of comments this is invisible. The 1% case — friend refreshes precisely while replication is lagging — is what generates the occasional support ticket and post-mortem entry.
Common confusions
-
"TAO is just memcached" — early Meta did use memcached as the cache layer, and TAO is its successor. The crucial difference is that memcached has no notion of a graph: it's a key-value cache, and applications had to manually shard, look up, and assemble graph data. TAO knows about objects and associations as first-class citizens, supports range queries and counts natively, and enforces the single-master write path. Memcached didn't enforce write ordering; TAO does, per shard.
-
"Single-master per shard is a SPOF" — a shard's home region having a leader outage means writes for that shard pause, but reads continue from local followers and from MySQL replicas. Reads pause only when the local MySQL replica is also down. Master failover within the home region is fast (seconds) and automated; what's deliberately avoided is cross-region failover for a shard's master role, because that would require a globally consistent shard-to-region mapping.
-
"Eventual consistency means data loss is possible" — eventual consistency in TAO means write visibility lags, not write durability. Once the home-region MySQL master has acknowledged the write, the data is durable. The eventual part is when the read replicas and follower caches in other regions catch up. A network partition between regions does not lose writes; it only delays read visibility, which resumes when the partition heals.
-
"You can replace MySQL with anything" — Meta has experimented with multiple storage backends under TAO, but MySQL specifically because the schema is sparse-and-uniform and the per-shard size fits in a single MySQL replica's storage budget. Replacing the storage tier means re-implementing the leader-to-storage write protocol, the replica-lag SLOs, and the operational tooling. The data plane is interchangeable in principle and locked in in practice.
-
"This is the same as a Bigtable" — Bigtable is a distributed key-value store with row-level transactions and column families; TAO is a cache layer with a graph data model and a single-master write protocol. They could in principle be combined (TAO on Bigtable instead of MySQL) but the workloads they're optimised for are different. TAO assumes 99% read load and a graph access pattern; Bigtable assumes a wider range of access patterns including bulk scans and writes.
-
"Read-your-writes for the writer is automatic" — it is not automatic; it requires the follower-update-on-write protocol described above. A follower cache that simply waited for the standard invalidation broadcast would show the writer their pre-write data for the seconds it took the invalidation to reach their follower. The explicit "update local follower synchronously after write" step is what makes the contract real, and it is the kind of detail that's easy to miss in a design doc and impossible to retrofit later.
Going deeper
Why the social graph rejected SQL joins
A naïve relational schema would represent friends as a friendships(user_a, user_b) table and asking "show me friends-of-friends who like cricket" as a three-way join: friendships ⋈ friendships ⋈ likes. At graph cardinality (2-billion users, 200 friends each on average, 100 likes per user), the intermediate result of the first join alone is 400 billion rows — a query plan no MySQL planner can execute interactively. The graph model's contribution is to make every "friends-of-friends" query a series of small assoc_range reads from cache, where the application decides the join order and prunes aggressively at each step. The query planner moves from the database into the application, where it can use product-specific knowledge (e.g., "stop after 50 friends of friends" or "rank by a specific feature") that a general-purpose planner can't have.
Cross-shard writes — the no-go that shaped the whole stack
The decision not to support cross-shard transactions in TAO is the most consequential negative-space decision in the design. It means that "transfer this group's ownership from user A to user B" — which involves rewriting edges on shards owned by both users — has to be handled by the application as two single-shard writes with a reconciliation protocol. Meta has internal frameworks for this (idempotent operation logs, async reconciliation jobs) but they are not transparent to product engineers. The trade-off accepted is that rare cross-shard operations are made harder so that common single-shard operations stay fast and simple. KapitalKite, building a financial system, would not accept this trade — financial transactions need cross-account consistency and would force a real distributed transaction layer. Meta accepted it because the social graph's writes are dominated by likes, comments, and friend operations, all of which are single-shard.
The TAO paper and what it doesn't say
The 2013 USENIX paper "TAO: Facebook's Distributed Data Store for the Social Graph" is the canonical reference. What the paper documents thoroughly: the data model, the cache hierarchy, the consistency contract. What the paper says little about: how TAO's leader fails over within a region, how the home-region assignment is changed for a shard (it is rare and operationally heavy), how the follower-cache invalidation protocol behaves under regional partitions. The community has had to reverse-engineer these from talks and incident reviews. This is a feature of all production-system papers: the paper describes the design at a moment in time, and the design has continued to evolve.
Comparing TAO with Twitter's Manhattan and Discord's session-storage
Twitter's Manhattan was designed for similar read-heavy social workloads, with a key difference: Manhattan exposes a more general key-value model rather than a graph-native one, so applications above it often rebuild graph-like access on top. Discord's voice-and-chat session-storage uses Cassandra for the storage layer with similar regional-leader patterns for active sessions, but the consistency contract is different (Cassandra's tunable consistency vs MySQL's single-master). Reading these three together is a useful exercise: same workload class, three quite different architectural choices, all defensible.
What changes when you don't have Meta's scale
The TAO architecture is overkill for a service with under, say, 10 million users. A startup like PaySetu doesn't need a two-tier cache and a regional-leader topology — a single MySQL primary with a Redis cache in front is enough until well past their first million users. The point of studying TAO is not to copy it but to understand which decisions in it are workload-driven (the graph model, the read-mostly assumption, the single-master write path) and which are scale-driven (the two-tier cache, the regional-leader pattern). The workload-driven decisions transfer to smaller systems; the scale-driven ones do not.
Where this leads next
The pattern from this chapter — read-mostly graph caching with single-master per-shard writes — sets up the next case studies in Part 20:
- Netflix: resilience culture — Netflix's stack also leans heavily on read-side caches (EVCache) but for a different workload (video catalogue and personalisation) and on AWS rather than custom infrastructure. The contrast is illuminating.
- Discord: BEAM to Rust journey — Discord ran a TAO-equivalent for chat sessions on Erlang's BEAM and migrated parts to Rust. The story is partly about the language choice and partly about the workload pivot from low-latency chat to high-fanout voice.
- Cloudflare: anycast and global load balancing — Cloudflare's stack is the antithesis of single-master per-shard: every PoP is equal, every read is local, and the write path is rare. Reading Meta and Cloudflare back-to-back shows how workload (read-heavy with rare writes vs read-only edge serving) drives architecture.
The thread that ties Meta's choices together is that the social graph's read pattern is so dominant and so localised that it justifies an entire infrastructure layer dedicated to it. TAO is not a generic cache; it is a graph cache with a single-master write path and a session-aware consistency contract. Other companies would not build TAO because their workload doesn't look like the social graph. Meta built it because their workload only looks like the social graph.
References
- Nathan Bronson et al., "TAO: Facebook's Distributed Data Store for the Social Graph" (USENIX ATC 2013) — the canonical paper.
- Venkat Venkataramani, "Scaling Memcache at Facebook" (NSDI 2013) — TAO's predecessor and the paper that documents the regional cache patterns TAO inherited.
- Lakshmi Subramanian, "f4: Facebook's Warm BLOB Storage System" (OSDI 2014) — the photo-storage layer that sits below the graph and shows the same regional-leader pattern at a different layer.
- Aaron Beitch et al., "RAMP-TAO: Layering Atomic Transactions on Facebook's Online TAO Data Store" (VLDB 2021) — the modern follow-on, layering bounded multi-key transactions on top of TAO.
- Mahesh Balakrishnan et al., "Tango: Distributed Data Structures over a Shared Log" (SOSP 2013) — a Microsoft Research paper that gives the cleanest theoretical framing of the single-shared-log pattern that TAO's per-shard leader implements.
- Jeff Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004) — orthogonal to TAO but referenced because Meta's batch graph processing (offline ranking, recommendation training) reads from a MapReduce-style snapshot of TAO that is itself a major design subsystem.
- See also: Amazon: cells, shuffle-sharding, isolated fates, Google's stack, eventual consistency needs conflict resolution, the append-only log.