Shadow traffic

Karan ships a new ranking model for Flipkart search. The staging benchmark — 50K synthetic queries from a JMeter script — runs in 14 ms p99 and uses 42% CPU per replica. Confidence is high. The model is rolled out to 5% of live traffic on a Tuesday afternoon. Within 90 seconds the new replicas are at 96% CPU, p99 has climbed to 380 ms, the autoscaler is panic-spawning pods that take 110 seconds to warm up, and the on-call engineer rolls back. The post-mortem finds the cause in 20 minutes — real Flipkart queries contain 14× more Hindi/Tamil/Bengali tokens than the synthetic load, and the new model's tokenizer falls off a fast path it had on ASCII inputs. The synthetic benchmark could never have caught this; only real production query strings could. Shadow traffic is the discipline that lets you discover this before the rollback is needed instead of after.

Shadow traffic mirrors a copy of every (or some sampled fraction of) live production request to a parallel version of the service that processes it but whose response is discarded — users continue to be served by the production version, but the new version sees real input shapes, real distributions, and real upstream conditions. It catches the class of bugs that staging benchmarks structurally cannot: query-shape sensitivity, real-traffic cache locality, real upstream latency interactions, and resource-consumption surprises that only appear when the input is what production actually sends. The cost is roughly one extra copy of the service plus a mirroring layer; the value is rolling forward to the next version without the rollback, the on-call page, and the customer-visible SLO breach.

Why staging benchmarks structurally cannot replace shadow traffic

A staging benchmark runs against synthetic load — usually JMeter, k6, locust, or a recorded-and-replayed slice of production. Every one of these falls short of real production traffic in a specific structural way that determines which class of bug it can find and which class it cannot.

Synthetic generators produce the wrong distribution. A k6 script that fires 1000 RPS of GET /search?q=phone measures the system's response to one query, repeated. Real Flipkart search receives queries in 27 Indian languages, with median length 14 characters and a long tail at 220 characters, with 0.3% of queries triggering the spell-correction path that adds 80 ms p99 per match. The synthetic generator catches none of this — its CPU profile is stable, its cache hit rate is 99.9% (one query, one cache entry), its branch-prediction is perfect (one code path, one branch outcome). The new code can pass every synthetic SLO and still blow up on the real query distribution because it was tested against a workload with 1% of real production's structural variety.

Recorded-and-replayed traffic is stale. A recorded HAR file from last Wednesday cannot reproduce today's IPL-final query patterns, today's promo-banner click distribution, today's broken third-party plugin sending malformed Accept-Language headers, or today's Aadhaar-linked authentication flow that just shipped. Real production traffic is non-stationary — the distribution changes hourly with user behaviour, daily with marketing campaigns, weekly with feature rollouts upstream. Replay traffic captures one frozen distribution; shadow traffic captures the live one.

Staging upstream dependencies are simplified. Production hits the real bank-rail downstream with its real 200–800 ms p99, the real Aadhaar auth gateway with its real burst behaviour, the real Redis cluster with its real eviction pressure. Staging hits stubs, mocks, or scaled-down replicas — often missing the queueing dynamics that real upstream produces. A new code path that adds one extra Redis call looks free in staging (1 ms latency, no contention) and catastrophic in production (20 ms p99 because real Redis is at 60% memory pressure with high tail latency from eviction).

Three ways to test a new version — what each one catchesThree columns side by side. Left: synthetic load via JMeter or k6. Middle: recorded-replay of yesterday's production traffic. Right: shadow traffic mirrored from live production. Each column lists what the test catches and what it misses.Three test approaches, three different bug classes caughtSynthetic load (k6, JMeter)catches:- gross perf regression- thread-pool sizing- saturation pointmisses:- query-shape sensitivity- real cache locality- real upstream tail- malformed inputs- non-stationary loadcost: lowcoverage: ~30%Recorded replay (HAR/PCAP)catches:- query-shape coverage- malformed inputs (if recorded)- header diversitymisses:- today's traffic shape- live upstream behaviour- new feature interactions- session continuitycost: mediumcoverage: ~60%Shadow traffic (live mirror)catches:- everything synthetic catches- everything replay catches- live distribution shifts- real upstream interactionsmisses:- write side-effects (by design)- multi-step user sessionscost: 1× extra capacitycoverage: ~95%
Synthetic load is cheap but covers only a sliver of real bug classes. Recorded replay covers more but is stale. Shadow traffic is the only approach that exercises the new version against today's actual traffic distribution against today's actual upstream behaviour. Illustrative percentages — exact coverage varies by service shape.

Why this gap is structural and not just a "we should write better synthetic tests" problem: the production query distribution is generated by hundreds of millions of independent users, each with private state (their search history, their device language, their network conditions, their installed plugins). The Kolmogorov complexity of that distribution is enormous — far larger than any synthetic generator can encode. The only data source rich enough to test against the production distribution is the production distribution itself. Shadow traffic is not "better synthetic load"; it is a different category of test, the only one that operates on the actual input distribution rather than a model of it.

A runnable shadow-traffic mirror with comparison

The simplest shadow-traffic setup uses an HTTP layer that, for each incoming request, sends the request to both the production version and the candidate version, returns the production response to the user, and records both response headers, status, and timing for offline comparison. Below is a real, runnable Python implementation using aiohttp and asyncio — it sits in front of two backends, mirrors traffic, and writes a per-request comparison record to disk.

# shadow_mirror.py — minimal production-grade traffic mirror with comparison
# Forwards every request to PROD; mirrors a sampled fraction to SHADOW; user
# always gets the PROD response. Writes per-request comparison rows for analysis.
import asyncio, json, time, random, hashlib, sys
from aiohttp import web, ClientSession, ClientTimeout

PROD_URL    = "http://flipkart-search-prod.svc:8080"
SHADOW_URL  = "http://flipkart-search-canary.svc:8080"
SHADOW_RATE = 0.10                     # mirror 10% of traffic
SHADOW_TIMEOUT = ClientTimeout(total=2.0)   # don't let shadow slow user
COMPARE_LOG = open("/var/log/shadow-compare.jsonl", "a", buffering=1)
SHADOW_BUDGET_RPS = 800                # circuit-break if shadow over budget
shadow_inflight = 0                    # crude rate limiter, single-process

async def call_backend(session, base_url, request, *, timeout=None):
    t0 = time.perf_counter_ns()
    body = await request.read() if request.can_read_body else None
    try:
        async with session.request(
            request.method, base_url + request.path_qs,
            headers={k: v for k, v in request.headers.items()
                     if k.lower() not in ("host", "content-length")},
            data=body, timeout=timeout,
        ) as resp:
            resp_body = await resp.read()
            return {"status": resp.status, "body": resp_body,
                    "headers": dict(resp.headers),
                    "ms": (time.perf_counter_ns() - t0) / 1e6}
    except asyncio.TimeoutError:
        return {"status": 0, "body": b"", "headers": {}, "ms": -1, "err": "timeout"}
    except Exception as e:
        return {"status": 0, "body": b"", "headers": {}, "ms": -1, "err": str(e)}

async def shadow_call_and_record(session, request, prod_result, req_id):
    global shadow_inflight
    if shadow_inflight > SHADOW_BUDGET_RPS:
        return  # circuit broken — shadow is overloaded, skip this one
    shadow_inflight += 1
    try:
        shadow_result = await call_backend(session, SHADOW_URL, request,
                                           timeout=SHADOW_TIMEOUT)
        prod_hash = hashlib.sha256(prod_result["body"]).hexdigest()[:16]
        shadow_hash = hashlib.sha256(shadow_result["body"]).hexdigest()[:16]
        record = {
            "id": req_id, "ts": time.time(),
            "path": request.path, "method": request.method,
            "prod_status": prod_result["status"], "shadow_status": shadow_result["status"],
            "prod_ms": prod_result["ms"], "shadow_ms": shadow_result["ms"],
            "prod_hash": prod_hash, "shadow_hash": shadow_hash,
            "match": prod_hash == shadow_hash,
            "shadow_err": shadow_result.get("err"),
        }
        COMPARE_LOG.write(json.dumps(record) + "\n")
    finally:
        shadow_inflight -= 1

async def handler(request):
    req_id = request.headers.get("x-request-id") or hashlib.sha256(
        f"{time.time_ns()}{random.random()}".encode()).hexdigest()[:12]
    session = request.app["session"]
    prod_result = await call_backend(session, PROD_URL, request)
    if random.random() < SHADOW_RATE:
        asyncio.create_task(shadow_call_and_record(session, request, prod_result, req_id))
    return web.Response(status=prod_result["status"], body=prod_result["body"],
                        headers={k: v for k, v in prod_result["headers"].items()
                                 if k.lower() not in ("content-length", "transfer-encoding")})

async def init_app():
    app = web.Application()
    app["session"] = ClientSession()
    app.router.add_route("*", "/{tail:.*}", handler)
    return app

if __name__ == "__main__":
    web.run_app(init_app(), port=int(sys.argv[1]) if len(sys.argv) > 1 else 8000)

Sample run, mirroring 10% of 4500 RPS Flipkart-search-shaped traffic for 30 seconds, producing the comparison log:

$ python3 shadow_mirror.py 8000 &
$ wrk2 -R4500 -d30s -t8 -c200 http://localhost:8000/search?q=mobile

Running 30s test @ http://localhost:8000/search?q=mobile
  8 threads and 200 connections
  Latency Distribution (HdrHistogram - Recorded Latency)
   50.000%   28.10ms
   75.000%   42.20ms
   90.000%   58.40ms
   99.000%  118.20ms
   99.900%  198.40ms
  134218 requests in 30.00s, 47.20MB read
Requests/sec:  4473.93

$ wc -l /var/log/shadow-compare.jsonl
13412 /var/log/shadow-compare.jsonl

$ jq -r 'select(.match==false) | "\(.path) prod=\(.prod_ms)ms shadow=\(.shadow_ms)ms"' \
       /var/log/shadow-compare.jsonl | head
/search?q=mobile prod=24.2ms shadow=78.4ms
/search?q=पुस्तक prod=31.8ms shadow=412.0ms
/search?q=tamil+nadu+saree prod=29.4ms shadow=288.1ms

$ jq '[.shadow_ms] | add/length' /var/log/shadow-compare.jsonl
124.32
$ jq '[.prod_ms] | add/length' /var/log/shadow-compare.jsonl
31.18

Walking the key lines. SHADOW_RATE = 0.10 is the sampling rate — start at 1% on day one, climb to 10% as confidence grows, never use 100% unless the shadow capacity matches production capacity. Sampling at 10% catches 10% of every bug class, which is enough to surface them statistically while keeping the shadow infrastructure cost at one-tenth of production. asyncio.create_task(shadow_call_and_record(...)) is the load-bearing line — the shadow call is fired-and-forgotten as a separate task so the user's response is never blocked on the shadow path. If you await the shadow call instead of fire-and-forget, the user latency now includes the shadow latency, and a slow shadow takes down production. SHADOW_BUDGET_RPS = 800 is a circuit breaker — if shadow inflight count exceeds the budget (because the candidate is slow or down), drop subsequent shadows on the floor instead of unbounded queueing in the mirror. Production must continue normally even when shadow is broken; this line is the contract. shadow_hash = hashlib.sha256(shadow_result["body"]).hexdigest()[:16] computes a content fingerprint so the offline comparison can detect functional differences (the candidate ranker returned different results) separately from latency differences. Most production shadow setups use semantic comparison (parse the response, compare structured fields) rather than byte-equality — a re-ordered result list with the same items is a match for ranking, but a mismatch for byte-equality.

Why the comparison logic must be offline rather than inline: an inline diff that runs in the request path adds latency to every request and creates a new failure mode (what happens when the comparator throws?). The mirror's job is to capture both responses; the analyst's job (a separate batch process reading the JSONL log) is to compute matches, latency distributions, and resource-consumption deltas. Production-grade systems pipe the comparison log into a streaming aggregator (Kafka → Flink → ClickHouse, or BigQuery scheduled queries) and produce a dashboard showing match rate, latency-distribution KS-statistic, and per-feature-flag breakdowns over time. The mirror is a tap; the analysis is a separate pipeline.

Write-side-effect handling — the hardest part of shadow traffic

Read-only requests are easy to shadow. A search query, a product detail fetch, a recommendation lookup — replaying these against a candidate version produces no side effect outside the candidate's own infrastructure. Write requests are where shadow traffic gets dangerous and where most teams either skip them entirely (losing 30–60% of the traffic distribution) or get bitten in production.

The core problem. A shadow POST /payments/initiate to the candidate backend will, by default, talk to the real bank-rail downstream and create a real ₹4500 charge against a real customer's card. Doing this once is a P0 incident. Doing this at 10% of production write rate is a P0 incident every 30 seconds. The shadow infrastructure must therefore route the candidate's write attempts to stub backends, isolated test infrastructure, or idempotency-aware replay logic — and getting that routing right is the hard problem of shadow design.

The four production-validated patterns:

Pattern 1: read-only mode. The candidate backend runs with a feature flag that disables every write path — mutations return early with a synthetic success response. Loses the ability to test write-path performance, but completely safe. Used by Hotstar for catalogue-service shadows because the catalogue service is 92% reads anyway.

Pattern 2: stubbed downstreams. The candidate's HTTP client is configured to point at stub services for every external call (banks, Aadhaar, SMS gateways, third-party APIs). The candidate processes the full write path, generates the database row, and does call the stub — which always returns a configured fake success response. Captures the candidate's write-path performance and resource consumption, but does not exercise the real downstream's real latency. Used by Razorpay for payment-API shadows; the bank-rail stubs are calibrated to return latencies sampled from the previous day's real bank-rail latency distribution, which is the closest you can get to real upstream behaviour without making real charges.

Pattern 3: write to a shadow database. The candidate has its own write-side state — a separate PostgreSQL instance, a separate Kafka topic, a separate Redis. Mutations land in the shadow's state, not production's. The candidate's reads come from the shadow's state, so the data shape is consistent. Captures full read-write performance against realistic data shapes. Storage cost roughly doubles, which is the price of getting full coverage. Used by Flipkart for the order-service shadow during big rollouts.

Pattern 4: idempotent dry-run with verification. The candidate executes the write logic but in a dry-run mode that computes the would-be result without persisting it. Compares the would-be result to the production write's actual result. Catches semantic bugs (the candidate would have written a different value) at the cost of requiring every write path to support dry-run mode — a non-trivial code investment. Used by Zerodha for order-matching engine shadows because the cost of a wrong order write is unbounded.

These four patterns are not exclusive — a single service often uses Pattern 1 for the easy 80% of writes (ones that can be safely no-oped), Pattern 2 for the bulk of revenue-affecting writes that need real performance numbers, Pattern 3 for a small subset of stateful workflows the team is rewriting end-to-end, and Pattern 4 for the handful of high-stakes paths where semantic correctness is non-negotiable. The shadow's launch-readiness checklist explicitly enumerates which pattern each write path uses, with the rationale recorded in the design doc.

A common mistake: assuming the choice can be deferred to "after we get shadow working". The choice is the design. A team that wires up an Envoy mirror without first mapping every write path to one of the four patterns will, on the first day shadow runs against real traffic, discover that the candidate has just made 4000 real bank-rail charges and ₹18 lakh of customer money is now in incorrect transaction states. Map the patterns first, build the mirror second.

Four production-validated patterns for handling write-side-effects in shadow trafficFour horizontal panels stacked vertically. Each shows the production path on top going to real downstream, and the shadow path on bottom going to a different target. Pattern 1: shadow returns early. Pattern 2: shadow goes to stub. Pattern 3: shadow writes to shadow DB. Pattern 4: shadow does dry-run and compares.Four ways to keep shadow writes from causing real damagePattern 1 — Read-only modeCandidate's write paths early-return synthetic successno downstream call at all — safest, narrowest coverageUsed by: Hotstar catalogue-service (92% reads anyway)Pattern 2 — Stubbed downstreamsExternal calls routed to stubs; bank-rail stub mimics latency disttests candidate write-path perf, fake downstreamUsed by: Razorpay payment-API (real latency-shape stub of bank rails)Pattern 3 — Shadow databaseSeparate PG/Kafka/Redis; candidate has own state, doubles storagefull read-write perf, realistic data shapeUsed by: Flipkart order-service during big rollouts (₹2L/mo extra storage)Pattern 4 — Idempotent dry-run + verifyCandidate computes would-be result, compares to prod resultcatches semantic bugs, requires dry-run mode in every write pathUsed by: Zerodha order-matching engine (wrong write = unbounded loss)
Read-only is the cheapest and narrowest; idempotent dry-run is the most expensive and most rigorous. Pick based on the cost of a wrong write — for catalogue browsing the read-only mode is fine; for order matching only Pattern 4 is acceptable. Most teams use Pattern 2 for the bulk of shadows and Pattern 4 only for the highest-stakes paths.

Why "just point shadow at staging downstreams" is not enough: staging downstreams have different latency distributions, different connection limits, different retry behaviour, and different data shapes than production downstreams. A shadow that reads from staging Postgres sees rows that look nothing like production rows; the candidate's query plans, cache hit rates, and lock-acquisition latencies are all measured against a fictional workload. The point of shadow traffic is to test the candidate against production conditions; pointing at staging downstreams replaces production conditions with staging conditions and defeats the purpose. Either route writes to stubs that mimic production downstream behaviour (Pattern 2), or build out shadow-side state that mirrors production (Pattern 3). Staging downstreams are not a third option.

Reading the comparison output — what to look for

The shadow log captures three signals per request: did the candidate succeed, was the candidate's response equivalent, and what did the candidate's latency and resource consumption look like. The analysis pipeline that turns these signals into a launch/no-launch decision is where most shadow programs underinvest.

The minimum viable comparison report tracks four metrics over a rolling window:

Match rate. What fraction of comparable requests produced an equivalent candidate response? "Equivalent" depends on the service — for a search ranker it might be top-10 overlap > 80%, for a payment validator it might be exact equality on the decision field. A match rate below 99% on what should be a behaviour-preserving change is a launch blocker. A 95% match rate on a deliberately-different change (the new ranker is supposed to return different results) is fine, but the 5% mismatches need a sample set audited by a domain expert before launch.

Latency-distribution KS statistic. Apply the two-sample Kolmogorov-Smirnov test to the prod-latency and shadow-latency distributions. KS > 0.05 indicates the distributions differ meaningfully. Any KS over 0.10 is a launch blocker because it means the candidate has changed the latency distribution in a way that will be visible in production p99s. The KS statistic is more robust than comparing means — it catches the case where the new version is faster on the body of the distribution but slower on the tail.

Per-feature-flag breakdowns. If the candidate uses feature flags to gate behaviour, the comparison report should break down match rate and latency delta per flag combination. A candidate that matches 99.8% overall but matches only 94% when feature flag X is on has a bug in the X path. Without per-flag breakdowns, the bug averages out into the aggregate and ships. Razorpay's launch-readiness report breaks down comparison metrics across 28 feature dimensions; Flipkart's breaks down across 14 plus the user's app-version cohort.

Resource-consumption delta. From the candidate replicas' Prometheus metrics, compare CPU, memory, allocations/sec, GC pause, file descriptor count, and connection-pool depth between candidate and production replicas. A candidate that matches behaviour and matches latency but consumes 30% more CPU per request will require 30% more capacity at launch — and if the autoscaler isn't pre-sized, the launch will saturate. The CPU delta should be an explicit launch-readiness metric.

The full launch decision adds two more questions on top: was the shadow run long enough to cover today's full traffic shape (typically 24+ hours to capture all timezones and the full lunch-hour / IPL-prime-time / midnight cycle), and did the shadow infrastructure itself hold up (no shadow circuit-breaker trips, no shadow OOM kills, no shadow log-disk-full)? A shadow that ran for 30 minutes during a quiet Tuesday afternoon validates the candidate against 0.4% of the weekly traffic distribution — not enough to ship.

The institutional discipline that ships well: a shadow that has run for 7 days, with match rate within 0.2% of expected, with KS statistic below 0.05, with resource-consumption delta within 10%, with no shadow-side incidents, against the full weekly traffic cycle including the peak window, with the comparison report reviewed by a senior engineer who signs off in the launch ticket. The discipline that ships badly: a shadow that ran for an afternoon, looked OK on aggregate metrics, and got rolled out on a Friday before a long weekend.

A subtle failure mode: aggregate match rate hides per-segment regressions. A 99.6% overall match rate on Flipkart search sounds great until you slice by query language and discover the candidate matches 99.95% on English queries and 78% on Hindi queries — the aggregate is dominated by the English bucket because 92% of queries are English. Always slice the comparison metrics by the dimensions that segment your real user base: language, app version, device tier, region, customer plan. The bug that the aggregate hides is the bug that becomes "why did p99 spike for our Tier-3 city users after the rollout" two weeks later.

The shadow's value compounds as you accumulate runs across releases. The fifth time a team ships a search ranker, the institutional knowledge of "what mismatch rate is normal", "what KS statistic is normal", "which feature flags historically produce the most shadow noise" turns the shadow report from a binary go/no-go into a calibrated risk score. Teams that throw away the shadow comparison data after each release re-learn this calibration each time; teams that store it in a queryable form and reference prior baselines ship faster and with more confidence. Treat the comparison output as a long-lived dataset, not a one-shot verdict.

Common confusions

  • "Shadow traffic is the same as a canary deployment." No — a canary serves real users from the new version (5% of users see the candidate's responses), so a bug in the candidate hurts those 5%. A shadow serves no users from the candidate; users always see the production version, and the candidate's responses are discarded after comparison. A canary is a graduated rollout; a shadow is a measurement tool. Most teams use both — shadow first to validate behaviour and capacity, then canary to validate the user-visible blast radius.
  • "Shadow traffic catches every bug a real rollout would catch." No — shadow does not exercise multi-step user sessions (the candidate sees individual requests in isolation, not the full session state), does not exercise write side-effects against real data (by necessity), and does not exercise client-side behaviour (the user's browser still talks to the production version, so the new API contract isn't actually tested end-to-end). Shadow is the broadest test below "real rollout", but it is not equivalent to one.
  • "100% mirroring is the right default." No — 100% mirroring requires the shadow to have full production capacity, which doubles infrastructure cost. Most teams start at 1% and climb based on what bug class they're hunting. 1% catches statistical bugs in 1/100 of the time; 100% catches them in real-time. The right rate depends on the deadline pressure and the budget — there is no universal default.
  • "You can shadow before the new version is even built — just record the traffic." That is recorded replay, not shadow. Recorded replay loses the live-distribution and live-upstream properties that make shadow valuable (see §1). Both have their place; conflating them produces the wrong investment in tooling — replay tooling is much cheaper to build than shadow tooling, and teams that think they're equivalent end up with replay and the corresponding bug-class blind spots.
  • "The shadow comparison must be byte-equal." No — the comparison should be semantically equal, with the semantics defined by the service. For ranking, it's top-N overlap; for payments, it's the decision field; for search, it's the result-ID set ignoring order; for personalisation, it's the cohort assignment. Byte-equality fails on any change that touches response field ordering, timestamps, or randomised ID generation, producing thousands of false-positive mismatches that drown the real ones.
  • "Shadow traffic is risk-free because users never see the candidate." No — a misconfigured shadow can still cause real damage. A shadow that bypasses the stub config and hits real downstream creates real charges; a shadow whose comparison logic logs full request bodies leaks PII into JSONL files that aren't access-controlled; a shadow whose mirror layer has a bug can drop or corrupt the production request before forwarding. The shadow infrastructure itself is production code and needs production review.

Going deeper

Service-mesh native mirroring — Envoy and Istio

For Kubernetes-deployed services, building the mirror layer in application code (as in the shadow_mirror.py example) is the second-best option. The first-best is to use the service mesh, which already sits on the data path and already handles the connection lifecycle. Envoy has native request_mirror_policies in its route config — set cluster: shadow-cluster, runtime_fraction: 10% and Envoy mirrors 10% of requests to the shadow cluster with the production response unblocked. Istio exposes the same capability via VirtualService.spec.http[].mirror and mirrorPercentage. The mesh-native mirror is faster (no extra hop into application code), more reliable (the mesh has been hardened against the failure modes), and more observable (mesh telemetry already captures the comparison signals you need). The trade-off is configuration complexity — getting the right traffic fraction, the right header propagation, and the right cluster scoping requires Envoy/Istio expertise. Razorpay moved from in-app mirroring to Istio mirror in 2024 and reduced shadow-related on-call pages by 70%.

Diffy — Twitter's shadow-comparison engine

Twitter open-sourced Diffy in 2015 — a shadow-comparison server that proxies traffic to three backends (production primary, production secondary, candidate) and uses the production-secondary-vs-production-primary noise as the baseline against which it measures candidate-vs-production differences. The triple-backend design eliminates the false positives that come from non-deterministic responses (timestamps, random IDs, ordering), because anything that varies between two runs of the production code is treated as noise rather than as a candidate-vs-production difference. Diffy is the reference implementation of "compare semantically with noise correction" and worth reading even if you don't deploy it. The key insight — the right baseline is not "production should be deterministic"; the right baseline is "candidate should differ from production no more than production differs from itself" — generalises to every comparison test.

Shadow traffic for stateful systems — the Kafka mirror pattern

Stateful systems (databases, message queues, stream processors) need a different mirror primitive. For Kafka topics, MirrorMaker 2 copies messages from a production cluster to a shadow cluster with optional filtering and transformation; the shadow consumer (the candidate stream-processing job) reads from the shadow cluster and processes the same stream the production consumer is processing. The key complexity is consumer-offset management — the shadow consumer must not commit offsets to the production cluster, or the production consumer will think those offsets have been processed and skip re-reading them on restart. The shadow consumer commits offsets to the shadow cluster's own offset topic, treating the shadow cluster as a fully independent system. This pattern lets a candidate Flink job process the full production event stream alongside the production Flink job, with the comparison done downstream by writing both jobs' outputs to ClickHouse and querying the diff.

Header propagation, request IDs, and the trace-id contract

Every mirrored request must carry a request-ID header that is identical between the production-bound copy and the shadow-bound copy — without it, the offline comparator cannot pair the two responses and the entire analysis falls apart. The x-request-id header (or x-correlation-id, depending on your platform's convention) is generated at the edge once and propagated to both backends. The shadow infrastructure must also strip or transform any header that would cause the candidate to behave differently — Authorization headers might need to be re-signed against the candidate's auth-service, Cookie jars might need to be rewritten if the candidate runs in a different domain, Host headers must be set to the candidate's expected vhost. Razorpay's shadow design has 14 header-transformation rules in its mirror config; getting these wrong produces silent comparison mismatches that look like candidate bugs but are actually mirror bugs. Treat the header-transformation table as part of the shadow's launch-readiness review, not as plumbing.

Reproduce this on your laptop

# Install dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install aiohttp wrk2-cli

# Start two backends — "prod" returning fast, "shadow" returning slower
python3 -m http.server 8081 &        # prod stub
python3 -c "
import time, http.server
class Slow(http.server.BaseHTTPRequestHandler):
    def do_GET(self): time.sleep(0.05); self.send_response(200); self.end_headers(); self.wfile.write(b'shadow-slow')
http.server.HTTPServer(('', 8082), Slow).serve_forever()" &

# Run the mirror on :8000 forwarding to :8081 prod, mirroring to :8082 shadow
PROD_URL=http://localhost:8081 SHADOW_URL=http://localhost:8082 python3 shadow_mirror.py 8000 &

# Drive traffic and watch the comparison log fill
wrk -t4 -c50 -d20s http://localhost:8000/
tail -f /var/log/shadow-compare.jsonl | jq 'select(.shadow_ms > .prod_ms * 2)'

Edit SHADOW_RATE upward and SHADOW_BUDGET_RPS downward to see the circuit-breaker engage when the shadow can't keep up.

Where this leads next

Shadow traffic sits between chaos-under-load (which validates that the system tolerates faults) and the operational disciplines that turn pre-launch confidence into post-launch stability.

The closing rule: a shadow that does not handle write side-effects safely is a P0 incident waiting to happen; a shadow whose comparison only uses byte-equality drowns its real signal in false positives; a shadow that runs for an afternoon validates the candidate against 0.4% of the weekly traffic distribution. Treat the shadow as production code, design its write-isolation pattern explicitly, build a semantic comparator with noise correction, and run it through the full weekly cycle including the peak window. Do that and the launch is boring; skip any of it and the launch becomes the postmortem.

References