Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Netflix: resilience culture

It is a Sunday evening in São Paulo and a viewer hits play on a documentary. In the 400 milliseconds between her finger leaving the screen and the first frame painting, Netflix's edge does roughly forty things across thirty microservices: identifies the device, checks subscription status, consults the personalisation model, picks a CDN edge, fetches the manifest, decides the bitrate ladder, fetches the DRM licence, primes the audio track, and logs every step for the data warehouse. Any one of those services can be slow, returning bad data, or simply gone. What is unusual about Netflix is not that it has built each service to be reliable — every large company tries that. What is unusual is that Netflix has built an organisation in which every team assumes their dependencies will fail, designs around it, and then proves it by deliberately killing things in production. That cultural posture — not Hystrix, not EVCache, not the Simian Army on its own — is what differentiates Netflix's stack from Amazon's cells and Meta's social graph. It is the only one of the three where the binding decision was about engineering culture, not architecture.

Netflix's resilience model rests on three pillars: assume every dependency can fail (so wrap every cross-service call in a circuit breaker with a fallback), prove the system handles failure (so run Chaos Monkey, Chaos Kong, and FIT in production), and decentralise the operational ownership (so the team that builds the service is paged when it breaks). The technology — Hystrix, EVCache, the Simian Army, FIT — is widely copied. The culture that makes it work is not.

Assume every dependency will fail

The starting axiom at Netflix is that any service call you make can fail, slowly or completely, at any time, without notice. Not "rarely fails". Not "fails under heavy load". Always assume failure. This sounds like a platitude until you see what it does to code. A team building a "recently watched" rail on the home page would normally write something like: call the user-history service, get the list, render. At Netflix, that same code path looks like: call user-history wrapped in a circuit breaker; if the breaker is open, fall back to a per-user cached snapshot in EVCache; if EVCache is missing the snapshot, fall back to a generic popular-content rail; if even that fails, render the home page without the rail and log a degradation event. Every cross-service call has a bounded latency, a fallback, and a graceful degradation path.

This is not optional discipline applied where it seems important. It is enforced at the platform level by the client library — historically Hystrix, more recently a successor stack — that wraps every outbound RPC. A service that does not wrap its calls in the circuit breaker pattern simply does not pass production review. The library imposes a timeout, a thread-pool budget, and a fallback hook on every dependency. Why a thread-pool budget and not just a timeout: a slow dependency that times out at 1 second still ties up your thread for 1 second. If 200 requests per second arrive while the dependency is slow, you have 200 threads stuck. With a thread-pool budget (say, 20 threads dedicated to that dependency) the 21st request fails fast, the breaker trips, and your service stays responsive even though that one dependency is dying. The thread pool is the bulkhead — it limits the blast radius of one slow dependency to the threads allocated to it, not the whole service.

Illustrative. The breaker spends most of its life closed. When a dependency starts misbehaving the breaker opens and short-circuits to the fallback, giving the dependency time to recover. After a configurable timeout it half-opens, allows one probe through, and either closes (recovery) or stays open (still broken).

The fallback hook is the part that takes engineering taste. A bad fallback is "return null and let the caller deal with it" — that just propagates the failure up the stack. A good fallback is one that returns a sensible answer that keeps the user moving: a cached snapshot from a few hours ago, a generic non-personalised response, a default value that lets the page render. CricStream, building a similar architecture for cricket-final live streams, would learn this the hard way: a "fall back to null" on the personalisation service produces a blank carousel; a "fall back to popular-in-your-region" produces something the user can still click. The user never knows the personalisation service was down.

Prove it by breaking production on purpose

The corollary of "assume every dependency will fail" is that you must verify your fallbacks work. A fallback that has never been exercised is a fallback that has bugs. Netflix's response to this is the famous Simian Army — a fleet of fault-injection tools that deliberately break parts of the running production system and watch what happens.

Chaos Monkey kills a random EC2 instance during business hours. Chaos Gorilla takes down an entire AWS availability zone. Chaos Kong takes down an entire AWS region. Latency Monkey injects artificial delay into RPC calls. Conformity Monkey checks instances against a set of rules and shuts down ones that violate them. All of this runs against the live customer-facing system, not staging. The reason it has to run in production is that staging is never representative — it has the wrong data, the wrong traffic shape, the wrong dependencies, the wrong scale. Failure has to be exercised where it actually hurts, otherwise teams will rationalise their way out of fixing the bug.

The cultural pre-condition for Chaos Monkey is that the team responsible for a service is the team that gets paged when it breaks, and they cannot block fault injection. If a team could opt out, every team would opt out, the chaos tooling would atrophy, and the resilience would erode the moment a new dependency was added. Netflix's leadership made it explicit: you cannot turn off Chaos Monkey for your service. You can shape the schedule, you can choose business hours, you can negotiate the blast radius — but you cannot opt out. Why this organisational rule matters more than the tool itself: the tool is straightforward to build (about 200 lines of code calls the AWS API to terminate instances). The hard part is the social contract that says "yes, our service can be terminated at any time, and we accept the responsibility to make it tolerate that". Without that contract, the tool becomes a curiosity that runs in a sandbox.

The progression from Chaos Monkey to Chaos Kong is where the cultural depth shows. Chaos Monkey kills an instance — most services tolerate it because they run multiple instances behind a load balancer. Chaos Gorilla kills a zone — services have to actually be deployed across multiple zones, which forces correct stateless service design. Chaos Kong kills a region — at this point the system has to support full regional failover, which means the data layer must be replicated cross-region, the DNS must support fast failover, and the cache must warm up in the surviving region within minutes. Each level forces a different architectural property to actually be present, and the only way to know they are present is to break things and watch.

Illustrative. Each rung of the Simian Army forces a different architectural property to be present. A team that survives Chaos Monkey may still fail Chaos Gorilla; a team that survives Gorilla may still fail Kong. The pyramid runs in production continuously — the harder exercises are scheduled, the easier ones run unannounced.

EVCache, FIT, and the supporting tech

While the culture is the binding constraint, the supporting technology is what makes it survivable.

EVCache is Netflix's distributed memcached-based caching layer. Like Meta's TAO it sits in front of the durable store and absorbs the read load. Unlike TAO, EVCache is designed for AWS as the substrate: it is replicated across availability zones (writes go to all zones, reads serve from the local zone) and clients are aware of zone topology. The read path is simple — get(key) from the local zone replica — but the write path is "write to local zone, asynchronously replicate to peers". A read after a write in a different zone may briefly see stale data, but a read after a write in the same zone is consistent. The trade-off accepted is much like TAO's: zone-local consistency, eventual cross-zone, in exchange for fast reads everywhere.

FIT (Failure Injection Testing) is the in-process fault-injection framework. Where Chaos Monkey works at the infrastructure level (kill instances), FIT works at the request level: tag a single request with a "fail this dependency" header, and the client library will simulate the failure for just that request. This is what lets a team test their fallback for "user-history service down" without actually taking down the user-history service for everyone. Per-request fault injection is the missing rung on the ladder between unit tests (which can't reproduce production complexity) and Chaos Monkey (which is too blunt for fine-grained scenarios).

# fallback_simulation.py
# Simulate a degrading dependency and a circuit breaker with three fallback levels.
# Demonstrates how the fallback chain keeps user-visible success rate high even
# as the primary dependency fails.
import random, statistics, collections

NUM_REQUESTS = 50_000
PRIMARY_FAIL_RATE = 0.40   # personalisation service is degraded — 40% fail
EVCACHE_HIT_RATE = 0.85    # of fallbacks, EVCache snapshot has the data 85% of the time
GENERIC_AVAILABLE = 1.0    # generic popular-content rail is always available

random.seed(11)
counts = collections.Counter()
latencies = []

for _ in range(NUM_REQUESTS):
    if random.random() > PRIMARY_FAIL_RATE:
        counts['primary'] += 1
        latencies.append(random.gauss(35, 8))
    else:
        # primary failed — try EVCache snapshot
        if random.random() < EVCACHE_HIT_RATE:
            counts['evcache_fallback'] += 1
            latencies.append(random.gauss(6, 2))
        else:
            # EVCache miss — generic rail
            counts['generic_fallback'] += 1
            latencies.append(random.gauss(3, 1))

total = sum(counts.values())
print(f"Total requests:        {total:,}")
print(f"Primary success:       {counts['primary']:,}  ({counts['primary']/total*100:.2f}%)")
print(f"EVCache fallback:      {counts['evcache_fallback']:,}  ({counts['evcache_fallback']/total*100:.2f}%)")
print(f"Generic fallback:      {counts['generic_fallback']:,}  ({counts['generic_fallback']/total*100:.2f}%)")
print(f"User-visible failures: 0  (every request returned something)")
print()
print(f"p50 latency: {statistics.median(latencies):.1f}ms")
print(f"p99 latency: {sorted(latencies)[int(0.99*len(latencies))]:.1f}ms")

Sample output on a PaySetu analysis box:

Total requests:        50,000
Primary success:       30,061  (60.12%)
EVCache fallback:      16,876  (33.75%)
Generic fallback:      3,063   (6.13%)
User-visible failures: 0  (every request returned something)

p50 latency: 24.6ms
p99 latency: 48.3ms

Walkthrough: even with the primary dependency failing 40% of the time — which would be a major incident in most companies — the user-visible success rate is 100% because the fallback chain has two more layers below the primary. Why the p99 improves during the incident: the EVCache and generic fallback paths are faster than the primary (they are local cache reads instead of cross-service RPCs). When the primary fails, more traffic flows through the faster paths, pulling the tail latency down. This is the counter-intuitive property of well-designed fallbacks — the system can run faster (with degraded freshness) during a partial outage than during normal operation, which is why a poorly-tuned circuit breaker can mask a real problem if you only watch latency.

The Python simulation above is a 30-line approximation of the real production behaviour, but it captures the load-bearing property: every request returns something, even when 40% of the primary dependency is broken. That property, multiplied across forty microservices in a request, is what keeps Netflix watchable when AWS has a bad day.

Decentralised ownership — the operational mirror

The third pillar is organisational. Netflix famously runs without a centralised operations team. Each engineering team owns its services end-to-end: on-call, deployments, capacity, dashboards, post-mortems. This is the "freedom and responsibility" culture in operational form. The platform team provides the building blocks (deployment tooling, telemetry, the chaos suite, the client libraries) but does not run the services on the product teams' behalf.

The consequence is that the cost of a bad design is borne by the team that built it. A team that ships a service without proper circuit breakers, without dashboards, without runbooks, will be paged at 3am for incidents the platform team won't help them debug. This creates a sharp incentive to build resilient services from day one, because the alternative is sleepless nights for the team. KapitalKite, building a stockbroker stack, would find this organisational pattern useful but uncomfortable: the on-call burden falls heavily on the team that is also writing the next feature, and there is constant tension between feature delivery and operational hardening. Netflix's resolution is to make hardening a first-class deliverable, tracked as visibly as features.

The other consequence is that operational knowledge concentrates with the team that needs it most, instead of dissipating across a separate operations group. When a service degrades, the people debugging it wrote the code last week. That tightens the feedback loop: a hard-to-debug incident this week translates into refactoring the code next week. A central operations team would, by contrast, accumulate workarounds and runbooks rather than fixing the root cause.

Common confusions

"Chaos Monkey is the secret" — Chaos Monkey is the most-cited part of Netflix's stack, but it is the least unique. The tool is easy to build and many companies have built equivalents. What is hard to copy is the cultural commitment that no team can opt out, and the engineering practice that every dependency call has a tested fallback. Without the cultural piece, Chaos Monkey is theatre — it kills a few instances, the team complains, and the team is allowed to silence it. Netflix doesn't allow the silencing.
"Hystrix solves resilience" — Hystrix is one circuit-breaker implementation. Netflix has since deprecated active development on it in favour of newer libraries, but the pattern (timeout, thread-pool budget, fallback hook) is what matters. A team that adopts Hystrix without writing real fallbacks has a circuit breaker but no resilience — it will fail open or fail to a null and the user will see errors anyway.
"Eventually consistent caching is wrong for streaming" — eventually consistent is exactly right for streaming. The video catalogue, recommendations, and even the licence cache can tolerate seconds of staleness without user impact. The places where strong consistency is needed (payment status, account-state changes) are a tiny fraction of the request volume, and Netflix routes those through different paths. The mistake to avoid is applying strong consistency uniformly because you don't trust your reasoning about which reads can tolerate staleness.
"You can copy Netflix's culture by reading their tech blog" — you cannot. The tech blog describes the artefacts (libraries, tools, post-mortems) but the binding culture is hiring and promotion practices, the on-call rotation, the explicit decision that no one can opt out of chaos testing, and the willingness of leadership to defend that decision when a VP whose team got paged at 3am pushes back. None of that travels in a blog post.
"Chaos engineering is for big companies only" — the tooling scales down. A small team can write a 100-line script that randomly kills a single VM during business hours. The cultural part also scales down — a four-person team can decide that nobody is allowed to opt out. What does not scale down is the regional failover (Chaos Kong assumes you run in multiple regions), but that is a function of the architecture, not the chaos practice.
"Resilience and reliability are the same" — they are not. Reliability is the absence of failure; resilience is the presence of recovery. Netflix's choice was that pursuing pure reliability (99.999% uptime per service) was a losing game on a public cloud, so instead they built a system that can absorb failures and degrade gracefully. The user might see a slightly less personalised carousel during an outage, but the play button still works.

Going deeper

What FIT made possible that Chaos Monkey couldn't

Chaos Monkey is coarse — it kills an instance and you find out which downstream services were affected. FIT (Failure Injection Testing) is fine-grained: a single request is tagged "fail dependency X", and only that one request sees the failure. This makes two things possible. First, every team can run integration tests in production for their fallback paths without affecting other users. Second, before a Chaos Kong exercise, the platform team can simulate the regional failure for a small fraction of traffic via FIT and watch which services have unexpected fallbacks. The progression "FIT → Chaos Monkey → Chaos Gorilla → Chaos Kong" is the operational ladder from "we tested one path" to "we proved the whole system handles regional failure".

Why the data plane is in Cassandra, not RDS

Netflix runs much of its persistent state on Cassandra rather than on AWS-managed RDS. The reason is regional failover: RDS is single-region with cross-region read replicas; Cassandra is multi-region by default with tunable consistency. When Chaos Kong takes a region offline, RDS would require a full failover and re-pointing of all clients; Cassandra continues to serve from surviving regions. The trade-off is that Cassandra is harder to operate (you own the cluster, the compactions, the tombstones) but it removes a class of failures that AWS's regional architecture imposes on managed-database users. PaySetu, considering a similar bet, would weigh the operational complexity of running Cassandra against the resilience gain — for most workloads under 10M users, RDS plus careful regional design is enough.

The cost of the resilience tax

Every fallback, every circuit breaker, every chaos exercise is engineering effort that doesn't ship features. Netflix has openly said that the resilience work is roughly 20–30% of senior engineer time. The pay-off is that streaming continues to work during incidents that would take other companies offline — but the cost is that feature velocity is lower than it would be without the tax. A startup at 100k users would be foolish to pay this tax: their incidents are infrequent, their engineering hours are precious, and their users will tolerate the occasional outage. The decision to pay the resilience tax should track the user count and the cost of an outage, not the prestige of mimicking Netflix.

What changed when Netflix moved to AWS

The early Netflix DVD-mailing infrastructure ran on traditional datacentres with a small operations team and a relatively brittle stack. The streaming pivot — and the move to AWS that came with it — is what forced the resilience culture. AWS itself is full of fault domains: instances die, AZs degrade, regions have bad days. A team running on AWS at Netflix's scale will hit those failures every week. The choice was either to build the resilience or to be down constantly. The cultural insight is that resilience is not a feature you add later — it has to be the way every service is built from the start, which means the standard library has to enforce it, which means the platform team had to build Hystrix-and-friends as the path of least resistance for new services.

Where this leads next

The thread from this chapter ties to the next case studies in Part 20:

Discord: BEAM to Rust journey — Discord's resilience came from the Erlang/OTP runtime's process-level isolation, which is a different starting point from Netflix's library-level isolation. Reading them back-to-back shows two paths to the same outcome.
Cloudflare: anycast and global load balancing — Cloudflare's resilience model is about the network layer, not the service layer. Where Netflix builds resilience inside each service, Cloudflare builds it into the routing fabric.
Twitter: timeline fanout and rate limiting — Twitter's incident history shaped a different culture, more focused on capacity planning and rate-limiting than on chaos injection.

The lasting lesson from Netflix is that resilience is an organisational property as much as a technical one. The libraries and tools are widely copied; the culture that makes them work — no opt-outs, team owns the page, fallback is a first-class deliverable — is harder to import. A company that copies the tools without the culture ends up with a chaos tool nobody runs and circuit breakers nobody tests.

References

Cory Bennett and Ariel Tseitlin, "Chaos Monkey released into the wild" (Netflix Tech Blog, 2012) — the original announcement and design notes.
Ben Christensen, "Hystrix: Latency and Fault Tolerance for Distributed Systems" (Netflix Tech Blog) — the canonical write-up of the circuit breaker library.
Ali Basiri et al., "Chaos Engineering" (IEEE Software, 2016) — the academic framing of the practice that grew out of Netflix's experience.
Casey Rosenthal and Nora Jones, "Chaos Engineering: System Resiliency in Practice" (O'Reilly, 2020) — the book-length treatment, with chapters by Netflix engineers.
Bruce Wong and Christos Kalantzis, "EVCache: Building Distributed In-Memory Caching at Netflix" (Netflix Tech Blog) — the cache layer design.
Naresh Gopalani, "FIT: Failure Injection Testing" (Netflix Tech Blog) — the per-request fault injection framework.
See also: Amazon: cells and shuffle-sharding, Meta: scaling the social graph, to trust the system you must break it, observability is a data problem.