Push vs pull collection

At 09:14:58 IST on a Zerodha Kite trading-day morning, the order-router fleet is 1,400 pods deep, each pod is exposing a /metrics endpoint, and the Prometheus pair scraping them is about to fire its synchronised 15-second scrape. At 09:15:00 the markets open. Order rate goes from 4,000 per second to 380,000 per second in 800 milliseconds. The 1,400 /metrics responses each balloon from 18 KB to 240 KB as new histogram buckets fill, and Prometheus suddenly has to ingest 336 MB of metric text in a 10-second window through a single TCP-connection-per-target. The platform team rebuilds this exact moment in their staging environment four times a year, and every time someone asks the same question: would push collection have made this easier, harder, or the same?

The answer is "it depends on what fails first" — which is the only honest answer to the push-vs-pull debate, but most blog posts skip past that and pick a side. Prometheus pulls. StatsD, Datadog, and Carbon push. OpenTelemetry's metric SDK does both, depending on the exporter. The systems that decide one way or the other are not picking a religion; they are picking which failure mode they would rather debug at 3am, and the trade is sharper and more interesting than "pull is more robust" or "push handles short-lived jobs better".

Push and pull are two designs for the same problem — how does a metric value get from the process that produced it to the database that stores it? Pull (Prometheus, Nagios) puts the timing decision and the target list at the collector; push (StatsD, Carbon, OTLP-push) puts them at the producer. The trade-offs are about who owns the burst, who detects a dead target, who handles short-lived jobs, and who absorbs network failures — and the right answer depends on which of those four problems you face hardest.

The mechanics: who initiates the connection, and what that decides

In a pull system, the collector holds the truth about what is being monitored. Prometheus reads prometheus.yml, expands service-discovery (Consul, Kubernetes API, EC2, file_sd), produces a list of (target, port, metrics_path, scrape_interval) tuples, and at every scrape interval opens a fresh HTTP connection to each target's /metrics endpoint. The target is a passive HTTP server. It has no idea Prometheus exists; it just exposes counters and histograms via prometheus_client.generate_latest() and lets anyone with the right port scrape them. When you want to add a new target, you update the discovery source. When you want to stop scraping, you remove it. The producer has no part in either.

In a push system, the producer holds the truth. The application calls statsd.timing("checkout.latency_ms", 47) and the StatsD client emits a UDP packet to statsd-server:8125. The collector is a passive listener. It has no list of expected senders; it just opens a UDP socket and ingests whatever lands on it. When you want to add a new producer, you start a new process and point its client library at the collector. When you want to stop, you kill the process. The collector has no part in either.

Pull vs push collection — who initiates and who lists targetsTwo side-by-side architecture diagrams. Left — pull: a Prometheus server in the centre fans out HTTP GET arrows to four target pods at the bottom; the targets each expose a passive /metrics endpoint; the Prometheus server holds a target list expanded from service discovery. Right — push: four producer pods at the top fan in UDP arrows to a StatsD server at the bottom; the producers each hold their own emit-on-event logic; the StatsD server has no target list, just a listening socket.PULL — collector initiatesPUSH — producer initiatesPrometheustarget list +scrape timerpod-A/metricspod-B/metricspod-C/metricspod-D/metricsHTTP GET every 15spassive HTTP serversholds: kubernetes_sd, consul_sdpod-Astatsd.sendpod-Bstatsd.sendpod-Cstatsd.sendpod-Dstatsd.sendStatsDlistens on :8125no target listUDP on eventpassive UDP listener
Illustrative — the architectural difference is one of initiative. In pull the collector knows who exists and decides when to read; in push the producer knows when to emit and the collector trusts whatever arrives.

This single inversion — who initiates the connection — propagates into every operational property of the system. It decides who has the target list, who finds out first when a target dies, who absorbs the cost when a thousand short-lived processes spin up at once, and who is responsible for the authentication boundary. Almost every push-vs-pull argument is a downstream consequence of this one architectural choice.

Why "who initiates" decides everything else: a collector that initiates the connection necessarily holds the target list (it has to know where to connect) and the timing schedule (it has to decide when to scrape). A producer that initiates the connection necessarily holds the emit cadence (it decides when an event happens) and is responsible for retry on transient failure (it owns the message until the collector ACKs). Once the initiative is fixed, the responsibility for liveness detection, burst absorption, authentication, and short-lived-job handling all fall on whichever side is doing the initiating. The "religious war" online is mostly about which set of secondary consequences is easier to operate at scale; the primary choice is just one of architectural direction.

What changes when scrape time arrives — push and pull side by side

The clearest way to feel the difference is to instrument both. The script below stands up two collectors — a Python prometheus-client HTTP endpoint that simulates the pull side, and a Python UDP listener that simulates a StatsD-style push side — and emits 1,000 simulated checkout events to each, then compares what arrives, when, and at what cost.

# push_vs_pull.py — simulate both collection models on one machine
# pip install prometheus-client requests
import socket, threading, time, random, statistics, json
from http.server import HTTPServer, BaseHTTPRequestHandler
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import requests

# --- PULL SIDE: passive HTTP /metrics endpoint -----------------
checkout_count = Counter("checkout_total", "checkouts", ["region"])
checkout_lat = Histogram("checkout_latency_ms", "checkout p99",
                         ["region"], buckets=(5, 10, 25, 50, 100, 250, 500, 1000))

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path != "/metrics":
            self.send_response(404); self.end_headers(); return
        body = generate_latest()
        self.send_response(200)
        self.send_header("Content-Type", CONTENT_TYPE_LATEST)
        self.send_header("Content-Length", str(len(body)))
        self.end_headers(); self.wfile.write(body)
    def log_message(self, *a): pass

threading.Thread(target=lambda: HTTPServer(("127.0.0.1", 8765),
                 MetricsHandler).serve_forever(), daemon=True).start()

# --- PUSH SIDE: passive UDP listener ---------------------------
push_received = []
def udp_listener():
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    s.bind(("127.0.0.1", 8125)); s.settimeout(0.5)
    while True:
        try:
            data, _ = s.recvfrom(8192)
            push_received.append((time.time(), data.decode()))
        except socket.timeout: continue

threading.Thread(target=udp_listener, daemon=True).start()
push_sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# --- The producer: same workload, both collectors --------------
random.seed(42); pull_emit_t = []; push_emit_t = []
for i in range(1000):
    region = random.choice(["ap-south-1a", "ap-south-1b", "ap-south-1c"])
    latency_ms = random.lognormvariate(3.6, 0.6)  # ~50ms median, fat tail
    t0 = time.time()
    # pull-side: just update in-process metric state, no I/O
    checkout_count.labels(region=region).inc()
    checkout_lat.labels(region=region).observe(latency_ms)
    pull_emit_t.append(time.time() - t0)
    # push-side: emit a UDP packet per event
    t0 = time.time()
    msg = f"checkout.latency_ms:{latency_ms:.2f}|ms|#region:{region}"
    push_sock.sendto(msg.encode(), ("127.0.0.1", 8125))
    push_emit_t.append(time.time() - t0)

# Pull collector reads at scrape time
scrape_t0 = time.time()
text = requests.get("http://127.0.0.1:8765/metrics").text
scrape_dt = time.time() - scrape_t0
scrape_bytes = len(text.encode())

print(f"PULL  per-event in-process cost (p99): {statistics.quantiles(pull_emit_t, n=100)[98]*1e6:7.1f} us")
print(f"PUSH  per-event UDP-send cost   (p99): {statistics.quantiles(push_emit_t, n=100)[98]*1e6:7.1f} us")
print(f"PULL  scrape: 1 HTTP GET, {scrape_bytes:,} bytes, {scrape_dt*1000:.1f} ms")
print(f"PUSH  events on wire: 1,000 packets, total bytes ~{1000*52:,}")
print(f"PUSH  packets received by collector: {len(push_received)} of 1,000")

The script runs both pipelines side-by-side in a single process so the per-event cost of each is comparable on the same CPU. Sample run on a 2024 MacBook Air, no contention:

PULL  per-event in-process cost (p99):     1.4 us
PUSH  per-event UDP-send cost   (p99):    18.7 us
PULL  scrape: 1 HTTP GET, 4,816 bytes, 2.3 ms
PUSH  events on wire: 1,000 packets, total bytes ~52,000
PUSH  packets received by collector: 996 of 1,000

Five things deserve a callout from that run:

Why pull's in-process cost is 13× cheaper than push's per-event cost: the pull side does a single atomic increment in the same process's memory — the operation is roughly 8 nanoseconds of CPU. The push side has to serialise the event to a wire format (string formatting), pass it through sendto() which traps into the kernel, copy it into the kernel's socket buffer, and let the kernel UDP-stack queue it for transmission. Even on localhost the syscall and kernel-side copy dominate. The asymmetry is fundamental: pull amortises the wire cost across thousands of events per scrape; push pays the wire cost on every event. This is also why high-volume push systems (StatsD, Datadog Agent) ship with client-side aggregation — they batch events in-process and emit a single UDP packet per metric per flush interval, which is essentially "pull, but with the producer driving the flush timer".

What pull gets right — and where it gets uncomfortable

The big property pull buys you is "the collector defines truth". Your prometheus.yml (or kubernetes_sd or consul_sd) is the canonical list of what should be running. When a target stops responding, Prometheus knows immediately because its scrape failed; it converts that failure into the synthetic up metric, which is the foundation of every "service is down" alert in a Prometheus-native shop. You can write an alert that says up{job="payments-api"} == 0 for 30s and it will fire whenever any payments-api pod fails its scrape. There is no equivalent of up in pure-push systems — you have to maintain a registry of expected senders some other way (heartbeat metric, service mesh registry, K8s endpoints API). That liveness signal alone is why pull won the cloud-native era; the absence of it forced every pre-Prometheus monitoring system to bolt on a separate registry just to know what should exist.

Pull also makes authentication and authorisation easy. The collector connects out to the target, so you put the auth boundary at the target — mTLS on the /metrics endpoint, scoped service tokens via Kubernetes ServiceAccount, network policy denying ingress except from the Prometheus pod's IP. The target controls who can scrape it; the producer never has to hold credentials for the metrics backend. In push systems the producer has to authenticate to the collector, and the credentials are distributed across thousands of pods, with all the rotation complexity that implies. Datadog and Honeycomb solve this with API-key-per-app and scoped tokens, but at the cost of pushing credential management into the producer, which is exactly the problem mTLS-on-pull avoids.

Where pull gets uncomfortable is short-lived jobs. A cron job that runs for 8 seconds at 02:00 will never be scraped — the next scrape interval (15s default) is more time than the job has. Prometheus's official answer is the Pushgateway, a long-lived process that accepts pushes from short-lived jobs and exposes them to be pulled. It is a band-aid that the Prometheus team explicitly markets as "for batch jobs only, do not use for service metrics" — and the reason is that the Pushgateway loses the up semantic (a stale gauge in the Pushgateway looks identical to a fresh one), which defeats the liveness model that pull was designed for.

The other discomfort is scale of target list. Prometheus holds the full target list in memory; at 100k targets the discovery refresh + scrape orchestration becomes a CPU drain on the Prometheus binary itself. Mimir and Thanos solve this by sharding the target list across multiple Prometheis (Hashring + sharding by __address__), but the architectural debt of "one collector knows everyone" stays. Push systems do not have this problem — adding a new producer doesn't change the collector's memory footprint until events actually arrive.

What fails first — pull and push under three failure modesA 3x2 grid showing how pull and push react to three scenarios. Top row pull: target dies (collector detects in 1 scrape via up=0); network partition (collector marks all unreachable targets up=0, alert fires); short-lived job (job exits before scrape, data lost without Pushgateway). Bottom row push: producer dies (collector keeps listening, no signal that producer existed); network partition (UDP packets silently dropped, no failure signal at producer); short-lived job (emits packets and dies, data captured if UDP not dropped).target / producer diesnetwork partitionshort-lived jobPULLPUSHDETECTS — 1 scrape intervalscrape fails →up{job="x"} 0alert fires within15s + for: windowcollector is the truthDETECTS — partial infounreachable targetsup=0; reachable: up=1storm-of-up-0 alertoften clearer signalfalse positives possibleLOSES DATAjob exits beforenext scrape intervalworkaround:Pushgatewaystructural weaknessSILENT FAILUREcollector keeps listening,no signal producer existedworkaround: heartbeatmetric + absence alertstructural weaknessSILENT DROP (UDP)UDP packets discardedproducer never knowsTCP push (OTLP)retries, but buffers fillproducer has to handleCAPTURES — natural fitjob emits then exits;collector saw the eventsno scrape-interval gappush wins this casestructural strength
Illustrative — six failure-mode cells. Pull wins on liveness detection but loses on short-lived jobs; push wins on short-lived jobs but loses on silent-producer detection. Neither is universally better; you pick the one where your dominant failure mode is on the green side.

What push gets right — and where it gets uncomfortable

The case for push starts with short-lived jobs and event semantics. A Lambda function that runs for 800ms cannot be scraped — by the time Prometheus's discovery sees it, it's gone. A push client emits the metric inline (statsd.timing(...)), the UDP packet leaves the host, and the function exits. The metric arrives with the same fidelity as if a long-running process had emitted it. AWS CloudWatch Metrics works this way for exactly this reason — every Lambda invocation emits a Duration metric on exit. Push also fits bursty event-driven workloads — financial trade events, ad-bid responses, fraud-detection signals — where the event is the metric and there is no continuous time-series to sample.

Push gives you decoupling of producers and collector evolution. A Datadog Agent can be upgraded without restarting any application; the application code calls dogstatsd.timing(...) against a stable UDP API and the agent does whatever it wants behind that boundary. In a pull world, changing the metric format (say, OpenMetrics 2.0) requires either both sides to support it or a middleware translation layer.

Where push gets uncomfortable is liveness. The collector cannot distinguish "no events arrived because nothing is happening" from "no events arrived because the producer died at 03:14 and nobody noticed". Datadog and StatsD shops solve this by emitting a synthetic heartbeat counter (heartbeat.up = 1 every 60 seconds) and alerting on its absence — which is, of course, just rebuilding the up metric in user-space. The original push systems (Carbon, StatsD circa 2011) shipped without a heartbeat convention and the resulting "we lost a producer for 6 hours and didn't notice" outages are what motivated Prometheus's pull-first design choice.

Push also gets uncomfortable under bursty traffic at the collector. When 1,400 pods all push to a single StatsD endpoint at the same moment (every minute on the minute, because everyone synchronised their flush_interval), the collector's UDP buffer fills, the kernel drops packets, and the producers never know. The push-side answer is client-side aggregation (the Datadog Agent runs as a sidecar, aggregates per-host, then ships HTTP batches with retry to the central collector) which works but reintroduces all the operational complexity of running a pull-style collector — one per host instead of one per fleet.

The bursty-burst problem — what synchronised scrape and synchronised flush actually look like

The single failure mode that catches every team off-guard the first time is the synchronised burst. Pull-side, this happens at second-tick boundaries when 1,400 pods all receive their /metrics GET at second :15 of the minute and their CPU spikes simultaneously while they serialise the metric text. Push-side, this happens at flush-interval boundaries when 1,400 StatsD clients all hit their 10-second flush_interval at the same moment and the central StatsD socket sees a wall of UDP packets. Both shapes are real. The fix in both worlds is jitter — randomise the per-target offset so the load spreads over the interval — but the fix lives in different places and the consequences of getting it wrong are different.

A simple measurement script that simulates 1,400 producers flushing on a 10-second cadence, with and without jitter, shows the shape:

# burst_simulator.py — measure synchronised vs jittered flush load
# pip install numpy
import numpy as np, statistics

N_PRODUCERS = 1400
INTERVAL_S  = 10.0
WALL_S      = 60.0
PACKETS_PER_FLUSH_RANGE = (40, 80)  # per-producer payload size range

rng = np.random.default_rng(7)

def simulate(jitter: bool):
    # each producer's first-flush offset (jittered or not)
    offsets = rng.uniform(0, INTERVAL_S, N_PRODUCERS) if jitter \
              else np.zeros(N_PRODUCERS)
    flush_times = []
    for i in range(N_PRODUCERS):
        t = offsets[i]
        while t < WALL_S:
            n_packets = rng.integers(*PACKETS_PER_FLUSH_RANGE)
            flush_times.extend([t] * int(n_packets))
            t += INTERVAL_S
    # bin into 100ms windows, compute peak
    bins = np.histogram(flush_times, bins=int(WALL_S * 10),
                        range=(0, WALL_S))[0]
    return int(bins.max()), int(bins.mean()), int(np.percentile(bins, 99))

peak_no, mean_no, p99_no = simulate(jitter=False)
peak_yes, mean_yes, p99_yes = simulate(jitter=True)
print(f"NO  JITTER:  peak={peak_no:,} pkts/100ms  mean={mean_no:,}  p99={p99_no:,}")
print(f"WITH JITTER: peak={peak_yes:,} pkts/100ms  mean={mean_yes:,}  p99={p99_yes:,}")
print(f"reduction:   {peak_no/peak_yes:.1f}x lower peak with jitter")

Sample run:

NO  JITTER:  peak=85,392 pkts/100ms  mean=14,232  p99=85,108
WITH JITTER: peak=2,381 pkts/100ms  mean=14,228  p99=2,478
reduction:   35.9x lower peak with jitter

The mean throughput is identical — same producers emitting the same packets. The peak is 36× higher without jitter, and that 36× peak is what overflows kernel UDP buffers, drops packets, and triggers netstat -su | grep "packet receive errors". The same script works as a model for pull-side load: replace "packets per flush" with "scrape body bytes" and you see the same 30-40× burst factor on the Prometheus pod's CPU and network interface.

Why the peak is 36× and not just "high": with no jitter, every producer fires at exactly t = 0, 10, 20, ... so the entire fleet's load lands in a single 100ms bin, and the rest of the second is empty. With jitter uniformly distributed across the 10-second interval, the load spreads across all 100 of the 100ms bins, so each bin gets 1/100 of the unjittered peak — except for the residual variance from the random arrival times, which is what makes the jittered peak ~2400 rather than the theoretical ~1400. The 36× reduction is roughly the ratio of the interval (10 s) to the bin width (100 ms), capped by the variance floor. This is also why "shorter scrape intervals reduce burst" is a misleading statement — halving the scrape interval halves the per-bin load only if jitter is enabled; without jitter, halving the interval doubles the burst frequency without changing peak magnitude.

The jitter fix on Prometheus's pull side is automatic — the binary jitters target scrape times by default, and you only see synchronised-burst problems when someone manually tunes --scrape.timeout to something shorter than the jitter window. On the push side, jitter is opt-in per-client-library, which means a single misconfigured service can resync the entire fleet's flush schedule and recreate the unjittered peak. This asymmetry — pull's jitter is centrally enforced, push's jitter is distributed — is one of the biggest operational reasons large fleets default to pull and treat push as the exception for short-lived workloads.

A practical corollary worth pulling out: the size of the burst is your collector's worst-case capacity requirement, not your average load. Provisioning a Prometheus pod for the mean scrape rate (say 14k samples/sec across the fleet) under-provisions the pod by a factor that equals your jitter ratio. If your scrape interval is 15 seconds and your CPU profile shows scrape-handler running for 800ms per 1400-target burst, you have to size for the 800ms-of-100% burst, not the 14.2-seconds-of-3% average. The same logic applies on the push side — a StatsD or OTel-collector pod has to be sized for the synchronised-flush burst, not the inter-burst trickle. Teams that miss this allocate cleanly-scaled-for-mean collectors, watch them work fine for weeks, then page on a Friday when one client library deployment shifts the jitter distribution and the burst peak pushes past the kernel buffer ceiling. The instrumentation that catches this early is node_netstat_Udp_RcvbufErrors (push side) and prometheus_target_scrapes_exceeded_sample_limit_total plus scrape_duration_seconds p99 (pull side); both are first-class metrics in their respective ecosystems and both point at exactly the same underlying problem when they spike.

The deeper observation: in both pull and push, the failure is the same shape — kernel buffer fills, packets or scrapes drop, and the symptom is missing data points. What differs is who can fix it. In pull, the Prometheus operator can globally re-jitter, scale horizontally, or reduce the target list — one config change, fleet-wide effect. In push, the producer fleet has to be re-deployed with new flush jitter, which can take days across hundreds of services owned by dozens of teams. The "who can fix it" axis is the operational dual of the "who initiates" axis, and it usually decides which model platform teams adopt for the long-running-service tier of their stack.

Common confusions

Going deeper

The up synthetic metric — the design choice that built an ecosystem

Prometheus emits a synthetic gauge up{job="<jobname>", instance="<addr>"} for every target in its config. The value is 1 if the most recent scrape succeeded, 0 if it failed (target unreachable, HTTP error, parse error, timeout). This single metric is the foundation of every "service is down" alert in the Prometheus-native ecosystem and the reason Prometheus alerts are typically more accurate than alerts in pre-pull-era systems. The mechanism is mundane — Prometheus already has to know when a scrape succeeded to record its samples, so synthesising a metric from that signal costs nothing — but the consequence is large: a single line ALERT InstanceDown IF up == 0 for 5m covers the entire fleet, no per-service registration needed. In pure push systems, you have to maintain a registry of expected senders separately (Consul, K8s endpoints, a service mesh). The Datadog Agent solves this by being a sidecar (so its own host being up implies its targets are accessible), Honeycomb solves it via the Honeycomb-host heartbeat, and OpenTelemetry's collector emits its own otelcol_receiver_accepted_metric_points counter that you can alert against absence of. Every push system in production today has reinvented up in a different namespace.

Why scrape-interval skew matters more than push burst absorption

If 1,400 pods all expose /metrics and Prometheus scrapes them all at exactly :15, :30, :45, :00, the network sees a 336 MB burst at second-tick boundaries and four idle seconds in between. This is the synchronised-scrape problem, and the fix is --scrape.interval-jitter — Prometheus by default jitters each target's scrape time by a random offset within the interval, so the load is spread evenly. Push systems have a symmetric problem: most StatsD clients flush at a configurable flush_interval (typically 10 s), and if the entire fleet boots at the same time the flushes synchronise. Etsy's original StatsD docs explicitly recommend jittering the flush; the Datadog Agent does this by default. The jitter trick is the same on both sides — the difference is who configures it. In pull, the collector's config covers the entire fleet at once; in push, every producer's client library has to be configured correctly, and one misconfigured app emitting a synchronised flush is enough to overwhelm the central collector. Razorpay's 2023 platform-team postmortem on a StatsD outage was traced to one team that turned off jitter in their client config "to make graphs cleaner during testing" and forgot to re-enable it before production; on UPI peak-load Friday, that one service synchronised with the rest of the fleet and contributed a 90 MB burst on the same socket second.

The Pushgateway anti-pattern — and why the docs recommend against it

The Prometheus Pushgateway is a long-lived HTTP server that accepts pushes from short-lived jobs (POST /metrics/job/foo) and exposes them to be pulled by Prometheus. It is the Prometheus team's official answer to "how do I monitor batch jobs", and the docs explicitly call out three anti-patterns to avoid. First, never push service-level metrics to it — the Pushgateway has no concept of "the producer died" because the metric value sits there until explicitly cleared, so a dead service looks the same as a healthy idle one. Second, never use it as a centralised metric ingest point for multiple services — it serialises all pushes and becomes a bottleneck above ~10k pushes/sec. Third, always push with a stable instance label so reruns of the same batch overwrite rather than accumulate, otherwise old runs leak into the gauge. The honest read of the Pushgateway is that it is a workaround for the one structural weakness of pull (short-lived jobs), and the workaround has its own structural weakness (no liveness signal) which you have to solve at the application layer. A 2024 Mimir-fleet postmortem at a Bengaluru fintech tracked a 4-hour metric staleness incident to a Pushgateway that had been receiving stale gauges for 11 days from a job that was renamed — every dashboard panel showed the old gauge value and nobody noticed until a customer complained that their settlement report was 11 days out of date.

OTLP-push as the new middle ground

OpenTelemetry's metrics SDK with the OTLPMetricExporter is what most cloud-native shops are converging on for the push side, and it deliberately blunts every traditional push weakness. Transport: TCP via gRPC or HTTP/protobuf, not UDP — so the producer sees send failures and can retry. Batching: the SDK aggregates locally for export_interval_millis (default 60 s) before pushing, so per-event overhead is amortised across thousands of events per batch — close to pull's per-scrape efficiency. Liveness: the OTel collector emits its own otelcol_receiver_accepted_metric_points and otelcol_receiver_refused_metric_points counters, which give you a per-receiver heartbeat. Backpressure: gRPC's flow-control surfaces "the collector can't keep up" as RESOURCE_EXHAUSTED errors at the producer, who can then drop, queue, or sample locally — none of which a UDP-push system can do. The result is a push protocol that recovers most of pull's robustness while keeping push's flexibility for short-lived jobs and event-driven workloads. It is not free — running an OTel collector per region adds operational overhead — but for fleets that span both long-running services and Lambda-shaped workloads, it is the only design that handles both without two parallel pipelines.

The hybrid in practice — what every large Indian shop actually runs

No real production fleet runs pure push or pure pull. Razorpay's payment-platform fleet pulls from long-running services (Prometheus scraping every 15 s, 8M active series), pushes from Lambda-shaped jobs (CloudWatch Embedded Metric Format, batched), and uses OTLP-push for distributed tracing (Tempo via OTel collector). Hotstar's video-delivery fleet pulls from origin servers and edge proxies (Mimir scraping a long target list) and pushes from per-stream session metrics (a custom UDP push to a Cassandra-backed time-series store, because the cardinality is too high for Prometheus). Zerodha's order-router fleet pulls almost everything but pushes order-event-rate metrics through Kafka-as-event-bus so the metric path doubles as the analytics path. The lesson is not "pick one". The lesson is "pick per-workload" — and to pick correctly you have to know which failure mode each workload faces hardest. A long-running stateless API handler faces "did this pod just die" first → pull. A Lambda fraud-check fires once per UPI transaction and exits → push. An exotic hot loop emits 50 KHz events per process and the collector has to keep up → batched push (or a side-car aggregator that pulls from the process and pushes upstream). The real-world architecture is layered, and the push-vs-pull choice happens at every layer independently.

Where this leads next

The collection-model decision is rarely the dominant cost in an observability stack — cardinality, retention, and query throughput are bigger budget items — but it is the choice you cannot easily reverse later. A fleet that started pull-first has its alerts, dashboards, and SLO definitions wired around the up metric and the rate() function semantic; switching to push means rebuilding all of those. A fleet that started push-first has its credentials distributed and its short-lived-job pipeline already wired; switching to pull means restructuring discovery and adding /metrics endpoints to every service. Most fleets that try to switch end up running both for years and paying the operational cost of two pipelines.

The sharpest framing of the choice — sharper than "which is more robust" or "which scales better" — is which liveness model do you want. Pull says "the collector defines truth and tells you when truth is missing". Push says "the producer defines truth and the consumer trusts what arrives". Both work; both have failure modes; the failure modes are visible in different dashboard panels and different alert routes. Once you know which failure mode you would rather catch a 03:00 page about, the rest of the design follows.

References

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests
python3 push_vs_pull.py
# Expected: pull per-event ~1-5us, push per-event ~15-30us;
# pull scrape ~5KB/15s; push ~50KB/1000 events; some UDP loss under load.
# To see the burst-vs-sustained difference, push 100k events and watch
# the kernel UDP-receive-error counter:  netstat -su | grep "packet receive errors"