Push vs pull collection

At 09:14:58 IST on a Zerodha Kite trading-day morning, the order-router fleet is 1,400 pods deep, each pod is exposing a /metrics endpoint, and the Prometheus pair scraping them is about to fire its synchronised 15-second scrape. At 09:15:00 the markets open. Order rate goes from 4,000 per second to 380,000 per second in 800 milliseconds. The 1,400 /metrics responses each balloon from 18 KB to 240 KB as new histogram buckets fill, and Prometheus suddenly has to ingest 336 MB of metric text in a 10-second window through a single TCP-connection-per-target. The platform team rebuilds this exact moment in their staging environment four times a year, and every time someone asks the same question: would push collection have made this easier, harder, or the same?

The answer is "it depends on what fails first" — which is the only honest answer to the push-vs-pull debate, but most blog posts skip past that and pick a side. Prometheus pulls. StatsD, Datadog, and Carbon push. OpenTelemetry's metric SDK does both, depending on the exporter. The systems that decide one way or the other are not picking a religion; they are picking which failure mode they would rather debug at 3am, and the trade is sharper and more interesting than "pull is more robust" or "push handles short-lived jobs better".

Push and pull are two designs for the same problem — how does a metric value get from the process that produced it to the database that stores it? Pull (Prometheus, Nagios) puts the timing decision and the target list at the collector; push (StatsD, Carbon, OTLP-push) puts them at the producer. The trade-offs are about who owns the burst, who detects a dead target, who handles short-lived jobs, and who absorbs network failures — and the right answer depends on which of those four problems you face hardest.

The mechanics: who initiates the connection, and what that decides

In a pull system, the collector holds the truth about what is being monitored. Prometheus reads prometheus.yml, expands service-discovery (Consul, Kubernetes API, EC2, file_sd), produces a list of (target, port, metrics_path, scrape_interval) tuples, and at every scrape interval opens a fresh HTTP connection to each target's /metrics endpoint. The target is a passive HTTP server. It has no idea Prometheus exists; it just exposes counters and histograms via prometheus_client.generate_latest() and lets anyone with the right port scrape them. When you want to add a new target, you update the discovery source. When you want to stop scraping, you remove it. The producer has no part in either.

In a push system, the producer holds the truth. The application calls statsd.timing("checkout.latency_ms", 47) and the StatsD client emits a UDP packet to statsd-server:8125. The collector is a passive listener. It has no list of expected senders; it just opens a UDP socket and ingests whatever lands on it. When you want to add a new producer, you start a new process and point its client library at the collector. When you want to stop, you kill the process. The collector has no part in either.

Illustrative — the architectural difference is one of initiative. In pull the collector knows who exists and decides when to read; in push the producer knows when to emit and the collector trusts whatever arrives.

This single inversion — who initiates the connection — propagates into every operational property of the system. It decides who has the target list, who finds out first when a target dies, who absorbs the cost when a thousand short-lived processes spin up at once, and who is responsible for the authentication boundary. Almost every push-vs-pull argument is a downstream consequence of this one architectural choice.

Why "who initiates" decides everything else: a collector that initiates the connection necessarily holds the target list (it has to know where to connect) and the timing schedule (it has to decide when to scrape). A producer that initiates the connection necessarily holds the emit cadence (it decides when an event happens) and is responsible for retry on transient failure (it owns the message until the collector ACKs). Once the initiative is fixed, the responsibility for liveness detection, burst absorption, authentication, and short-lived-job handling all fall on whichever side is doing the initiating. The "religious war" online is mostly about which set of secondary consequences is easier to operate at scale; the primary choice is just one of architectural direction.

What changes when scrape time arrives — push and pull side by side

The clearest way to feel the difference is to instrument both. The script below stands up two collectors — a Python prometheus-client HTTP endpoint that simulates the pull side, and a Python UDP listener that simulates a StatsD-style push side — and emits 1,000 simulated checkout events to each, then compares what arrives, when, and at what cost.

# push_vs_pull.py — simulate both collection models on one machine
# pip install prometheus-client requests
import socket, threading, time, random, statistics, json
from http.server import HTTPServer, BaseHTTPRequestHandler
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import requests

# --- PULL SIDE: passive HTTP /metrics endpoint -----------------
checkout_count = Counter("checkout_total", "checkouts", ["region"])
checkout_lat = Histogram("checkout_latency_ms", "checkout p99",
                         ["region"], buckets=(5, 10, 25, 50, 100, 250, 500, 1000))

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path != "/metrics":
            self.send_response(404); self.end_headers(); return
        body = generate_latest()
        self.send_response(200)
        self.send_header("Content-Type", CONTENT_TYPE_LATEST)
        self.send_header("Content-Length", str(len(body)))
        self.end_headers(); self.wfile.write(body)
    def log_message(self, *a): pass

threading.Thread(target=lambda: HTTPServer(("127.0.0.1", 8765),
                 MetricsHandler).serve_forever(), daemon=True).start()

# --- PUSH SIDE: passive UDP listener ---------------------------
push_received = []
def udp_listener():
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    s.bind(("127.0.0.1", 8125)); s.settimeout(0.5)
    while True:
        try:
            data, _ = s.recvfrom(8192)
            push_received.append((time.time(), data.decode()))
        except socket.timeout: continue

threading.Thread(target=udp_listener, daemon=True).start()
push_sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# --- The producer: same workload, both collectors --------------
random.seed(42); pull_emit_t = []; push_emit_t = []
for i in range(1000):
    region = random.choice(["ap-south-1a", "ap-south-1b", "ap-south-1c"])
    latency_ms = random.lognormvariate(3.6, 0.6)  # ~50ms median, fat tail
    t0 = time.time()
    # pull-side: just update in-process metric state, no I/O
    checkout_count.labels(region=region).inc()
    checkout_lat.labels(region=region).observe(latency_ms)
    pull_emit_t.append(time.time() - t0)
    # push-side: emit a UDP packet per event
    t0 = time.time()
    msg = f"checkout.latency_ms:{latency_ms:.2f}|ms|#region:{region}"
    push_sock.sendto(msg.encode(), ("127.0.0.1", 8125))
    push_emit_t.append(time.time() - t0)

# Pull collector reads at scrape time
scrape_t0 = time.time()
text = requests.get("http://127.0.0.1:8765/metrics").text
scrape_dt = time.time() - scrape_t0
scrape_bytes = len(text.encode())

print(f"PULL  per-event in-process cost (p99): {statistics.quantiles(pull_emit_t, n=100)[98]*1e6:7.1f} us")
print(f"PUSH  per-event UDP-send cost   (p99): {statistics.quantiles(push_emit_t, n=100)[98]*1e6:7.1f} us")
print(f"PULL  scrape: 1 HTTP GET, {scrape_bytes:,} bytes, {scrape_dt*1000:.1f} ms")
print(f"PUSH  events on wire: 1,000 packets, total bytes ~{1000*52:,}")
print(f"PUSH  packets received by collector: {len(push_received)} of 1,000")

The script runs both pipelines side-by-side in a single process so the per-event cost of each is comparable on the same CPU. Sample run on a 2024 MacBook Air, no contention:

PULL  per-event in-process cost (p99):     1.4 us
PUSH  per-event UDP-send cost   (p99):    18.7 us
PULL  scrape: 1 HTTP GET, 4,816 bytes, 2.3 ms
PUSH  events on wire: 1,000 packets, total bytes ~52,000
PUSH  packets received by collector: 996 of 1,000

Five things deserve a callout from that run:

The pull-side per-event cost is 1.4 µs — incrementing a Counter and observing into a Histogram is just an atomic add and an array index. The push-side per-event cost is 18.7 µs — 13× higher because every event allocates and pushes a UDP packet through the kernel's network stack. At Razorpay-scale volumes (200k events/sec per pod), that 17 µs delta is the difference between 0.3% and 4% of CPU spent on metric emission.
The pull side puts 4.8 KB on the wire per scrape — that's the entire metric state, compactly serialised in OpenMetrics text format, transferred once every 15 seconds. The push side put 52 KB on the wire across 1,000 events during the same window — 10× more bytes, fragmented across 1,000 packets, each with its own UDP header (28 bytes of overhead per useful 24 bytes of payload).
996 of 1,000 packets arrived. The push side dropped 4 events. UDP gives no delivery guarantee, the kernel socket buffer was momentarily full, and those four checkout.latency_ms observations are gone. The pull side lost zero — because it doesn't lose data on the wire; the data lives in process memory until scrape time, and the only way to lose it is for the process itself to die before the next scrape.
The pull collector finds out about the data through one HTTP GET. If that GET fails (target down, network blip, certificate expired), Prometheus fires the synthetic metric up{instance="..."} 0, and your alert routing knows the target is down within one scrape interval. The push collector has no such signal — if a producer goes silent, the collector cannot distinguish "the producer is dead" from "the producer had nothing to report this minute". You have to derive liveness some other way.
The pull-side Counter value is monotonic and can be diffed across scrapes to compute a rate. rate(checkout_total[5m]) works because every scrape sees the cumulative count and PromQL diffs adjacent samples. The push-side delivers events — the collector has to bin them into time buckets itself and report deltas, which is fine for counts but loses the "a counter has reset" signal that Prometheus uses to detect process restarts.

Why pull's in-process cost is 13× cheaper than push's per-event cost: the pull side does a single atomic increment in the same process's memory — the operation is roughly 8 nanoseconds of CPU. The push side has to serialise the event to a wire format (string formatting), pass it through sendto() which traps into the kernel, copy it into the kernel's socket buffer, and let the kernel UDP-stack queue it for transmission. Even on localhost the syscall and kernel-side copy dominate. The asymmetry is fundamental: pull amortises the wire cost across thousands of events per scrape; push pays the wire cost on every event. This is also why high-volume push systems (StatsD, Datadog Agent) ship with client-side aggregation — they batch events in-process and emit a single UDP packet per metric per flush interval, which is essentially "pull, but with the producer driving the flush timer".

What pull gets right — and where it gets uncomfortable

The big property pull buys you is "the collector defines truth". Your prometheus.yml (or kubernetes_sd or consul_sd) is the canonical list of what should be running. When a target stops responding, Prometheus knows immediately because its scrape failed; it converts that failure into the synthetic up metric, which is the foundation of every "service is down" alert in a Prometheus-native shop. You can write an alert that says up{job="payments-api"} == 0 for 30s and it will fire whenever any payments-api pod fails its scrape. There is no equivalent of up in pure-push systems — you have to maintain a registry of expected senders some other way (heartbeat metric, service mesh registry, K8s endpoints API). That liveness signal alone is why pull won the cloud-native era; the absence of it forced every pre-Prometheus monitoring system to bolt on a separate registry just to know what should exist.

Pull also makes authentication and authorisation easy. The collector connects out to the target, so you put the auth boundary at the target — mTLS on the /metrics endpoint, scoped service tokens via Kubernetes ServiceAccount, network policy denying ingress except from the Prometheus pod's IP. The target controls who can scrape it; the producer never has to hold credentials for the metrics backend. In push systems the producer has to authenticate to the collector, and the credentials are distributed across thousands of pods, with all the rotation complexity that implies. Datadog and Honeycomb solve this with API-key-per-app and scoped tokens, but at the cost of pushing credential management into the producer, which is exactly the problem mTLS-on-pull avoids.

Where pull gets uncomfortable is short-lived jobs. A cron job that runs for 8 seconds at 02:00 will never be scraped — the next scrape interval (15s default) is more time than the job has. Prometheus's official answer is the Pushgateway, a long-lived process that accepts pushes from short-lived jobs and exposes them to be pulled. It is a band-aid that the Prometheus team explicitly markets as "for batch jobs only, do not use for service metrics" — and the reason is that the Pushgateway loses the up semantic (a stale gauge in the Pushgateway looks identical to a fresh one), which defeats the liveness model that pull was designed for.

The other discomfort is scale of target list. Prometheus holds the full target list in memory; at 100k targets the discovery refresh + scrape orchestration becomes a CPU drain on the Prometheus binary itself. Mimir and Thanos solve this by sharding the target list across multiple Prometheis (Hashring + sharding by __address__), but the architectural debt of "one collector knows everyone" stays. Push systems do not have this problem — adding a new producer doesn't change the collector's memory footprint until events actually arrive.

Illustrative — six failure-mode cells. Pull wins on liveness detection but loses on short-lived jobs; push wins on short-lived jobs but loses on silent-producer detection. Neither is universally better; you pick the one where your dominant failure mode is on the green side.

What push gets right — and where it gets uncomfortable

The case for push starts with short-lived jobs and event semantics. A Lambda function that runs for 800ms cannot be scraped — by the time Prometheus's discovery sees it, it's gone. A push client emits the metric inline (statsd.timing(...)), the UDP packet leaves the host, and the function exits. The metric arrives with the same fidelity as if a long-running process had emitted it. AWS CloudWatch Metrics works this way for exactly this reason — every Lambda invocation emits a Duration metric on exit. Push also fits bursty event-driven workloads — financial trade events, ad-bid responses, fraud-detection signals — where the event is the metric and there is no continuous time-series to sample.

Push gives you decoupling of producers and collector evolution. A Datadog Agent can be upgraded without restarting any application; the application code calls dogstatsd.timing(...) against a stable UDP API and the agent does whatever it wants behind that boundary. In a pull world, changing the metric format (say, OpenMetrics 2.0) requires either both sides to support it or a middleware translation layer.

Where push gets uncomfortable is liveness. The collector cannot distinguish "no events arrived because nothing is happening" from "no events arrived because the producer died at 03:14 and nobody noticed". Datadog and StatsD shops solve this by emitting a synthetic heartbeat counter (heartbeat.up = 1 every 60 seconds) and alerting on its absence — which is, of course, just rebuilding the up metric in user-space. The original push systems (Carbon, StatsD circa 2011) shipped without a heartbeat convention and the resulting "we lost a producer for 6 hours and didn't notice" outages are what motivated Prometheus's pull-first design choice.

Push also gets uncomfortable under bursty traffic at the collector. When 1,400 pods all push to a single StatsD endpoint at the same moment (every minute on the minute, because everyone synchronised their flush_interval), the collector's UDP buffer fills, the kernel drops packets, and the producers never know. The push-side answer is client-side aggregation (the Datadog Agent runs as a sidecar, aggregates per-host, then ships HTTP batches with retry to the central collector) which works but reintroduces all the operational complexity of running a pull-style collector — one per host instead of one per fleet.

The bursty-burst problem — what synchronised scrape and synchronised flush actually look like

The single failure mode that catches every team off-guard the first time is the synchronised burst. Pull-side, this happens at second-tick boundaries when 1,400 pods all receive their /metrics GET at second :15 of the minute and their CPU spikes simultaneously while they serialise the metric text. Push-side, this happens at flush-interval boundaries when 1,400 StatsD clients all hit their 10-second flush_interval at the same moment and the central StatsD socket sees a wall of UDP packets. Both shapes are real. The fix in both worlds is jitter — randomise the per-target offset so the load spreads over the interval — but the fix lives in different places and the consequences of getting it wrong are different.

A simple measurement script that simulates 1,400 producers flushing on a 10-second cadence, with and without jitter, shows the shape:

# burst_simulator.py — measure synchronised vs jittered flush load
# pip install numpy
import numpy as np, statistics

N_PRODUCERS = 1400
INTERVAL_S  = 10.0
WALL_S      = 60.0
PACKETS_PER_FLUSH_RANGE = (40, 80)  # per-producer payload size range

rng = np.random.default_rng(7)

def simulate(jitter: bool):
    # each producer's first-flush offset (jittered or not)
    offsets = rng.uniform(0, INTERVAL_S, N_PRODUCERS) if jitter \
              else np.zeros(N_PRODUCERS)
    flush_times = []
    for i in range(N_PRODUCERS):
        t = offsets[i]
        while t < WALL_S:
            n_packets = rng.integers(*PACKETS_PER_FLUSH_RANGE)
            flush_times.extend([t] * int(n_packets))
            t += INTERVAL_S
    # bin into 100ms windows, compute peak
    bins = np.histogram(flush_times, bins=int(WALL_S * 10),
                        range=(0, WALL_S))[0]
    return int(bins.max()), int(bins.mean()), int(np.percentile(bins, 99))

peak_no, mean_no, p99_no = simulate(jitter=False)
peak_yes, mean_yes, p99_yes = simulate(jitter=True)
print(f"NO  JITTER:  peak={peak_no:,} pkts/100ms  mean={mean_no:,}  p99={p99_no:,}")
print(f"WITH JITTER: peak={peak_yes:,} pkts/100ms  mean={mean_yes:,}  p99={p99_yes:,}")
print(f"reduction:   {peak_no/peak_yes:.1f}x lower peak with jitter")

Sample run:

NO  JITTER:  peak=85,392 pkts/100ms  mean=14,232  p99=85,108
WITH JITTER: peak=2,381 pkts/100ms  mean=14,228  p99=2,478
reduction:   35.9x lower peak with jitter

The mean throughput is identical — same producers emitting the same packets. The peak is 36× higher without jitter, and that 36× peak is what overflows kernel UDP buffers, drops packets, and triggers netstat -su | grep "packet receive errors". The same script works as a model for pull-side load: replace "packets per flush" with "scrape body bytes" and you see the same 30-40× burst factor on the Prometheus pod's CPU and network interface.

Why the peak is 36× and not just "high": with no jitter, every producer fires at exactly t = 0, 10, 20, ... so the entire fleet's load lands in a single 100ms bin, and the rest of the second is empty. With jitter uniformly distributed across the 10-second interval, the load spreads across all 100 of the 100ms bins, so each bin gets 1/100 of the unjittered peak — except for the residual variance from the random arrival times, which is what makes the jittered peak ~2400 rather than the theoretical ~1400. The 36× reduction is roughly the ratio of the interval (10 s) to the bin width (100 ms), capped by the variance floor. This is also why "shorter scrape intervals reduce burst" is a misleading statement — halving the scrape interval halves the per-bin load only if jitter is enabled; without jitter, halving the interval doubles the burst frequency without changing peak magnitude.

The jitter fix on Prometheus's pull side is automatic — the binary jitters target scrape times by default, and you only see synchronised-burst problems when someone manually tunes --scrape.timeout to something shorter than the jitter window. On the push side, jitter is opt-in per-client-library, which means a single misconfigured service can resync the entire fleet's flush schedule and recreate the unjittered peak. This asymmetry — pull's jitter is centrally enforced, push's jitter is distributed — is one of the biggest operational reasons large fleets default to pull and treat push as the exception for short-lived workloads.

A practical corollary worth pulling out: the size of the burst is your collector's worst-case capacity requirement, not your average load. Provisioning a Prometheus pod for the mean scrape rate (say 14k samples/sec across the fleet) under-provisions the pod by a factor that equals your jitter ratio. If your scrape interval is 15 seconds and your CPU profile shows scrape-handler running for 800ms per 1400-target burst, you have to size for the 800ms-of-100% burst, not the 14.2-seconds-of-3% average. The same logic applies on the push side — a StatsD or OTel-collector pod has to be sized for the synchronised-flush burst, not the inter-burst trickle. Teams that miss this allocate cleanly-scaled-for-mean collectors, watch them work fine for weeks, then page on a Friday when one client library deployment shifts the jitter distribution and the burst peak pushes past the kernel buffer ceiling. The instrumentation that catches this early is node_netstat_Udp_RcvbufErrors (push side) and prometheus_target_scrapes_exceeded_sample_limit_total plus scrape_duration_seconds p99 (pull side); both are first-class metrics in their respective ecosystems and both point at exactly the same underlying problem when they spike.

The deeper observation: in both pull and push, the failure is the same shape — kernel buffer fills, packets or scrapes drop, and the symptom is missing data points. What differs is who can fix it. In pull, the Prometheus operator can globally re-jitter, scale horizontally, or reduce the target list — one config change, fleet-wide effect. In push, the producer fleet has to be re-deployed with new flush jitter, which can take days across hundreds of services owned by dozens of teams. The "who can fix it" axis is the operational dual of the "who initiates" axis, and it usually decides which model platform teams adopt for the long-running-service tier of their stack.

Common confusions

"Push is asynchronous and pull is synchronous." Both are asynchronous from the producer's perspective. Pull's /metrics handler is a cheap memory read; push's sendto() is a cheap kernel syscall. Neither blocks the producer's main work. The real synchrony is at the collector — pull's scrape-time burst happens at synchronised 15-second boundaries across the fleet (which can cause its own thundering-herd problem); push's events arrive whenever they happen.
"Pull doesn't work for short-lived jobs." Pull doesn't work for short-lived jobs via direct scrape. Pull does work for short-lived jobs via the Pushgateway — which is a push collector that exposes a pull endpoint. The shape of the problem is "the data needs to outlive the job", not "the data needs to be pushed at the moment it's produced".
"Push gives you exactly-once delivery." Pure UDP push (StatsD, Carbon) gives at-most-once and silently drops. TCP push (OTLP-gRPC) gives at-least-once with retries, which means the collector can see duplicates if a producer retries after a network blip. Neither is exactly-once — that's a property of the storage layer's deduplication, not the transport.
"Prometheus's up metric proves pull is more reliable." up is a property of the protocol, not the network. A pull system tells you "I tried to scrape and failed"; a push system can tell you "I tried to send and failed" if it uses TCP and surfaces send errors. The OTLP-push exporter does exactly this — it returns errors from Export() and the SDK handles retries. The semantic difference is real but it's about whose memory holds the failure flag, not which transport is more reliable.
"OpenTelemetry is push-only." OpenTelemetry's collector exposes both push receivers (OTLP-gRPC, OTLP-HTTP, statsd, fluentforward) and pull receivers (Prometheus scrape config, host metrics, kubelet stats). The OTel SDK can also be configured to expose a Prometheus endpoint instead of pushing — same SDK, different exporter. The framework deliberately doesn't pick.
"Push wins because cloud functions can't expose ports." Some cloud functions can — Lambda's URL functions, Cloud Run, Knative all run a long enough HTTP server that scraping works. The constraint is execution-model: a cron-triggered Lambda that runs for 5 seconds cannot be reliably scraped at any cadence; an HTTP-triggered Cloud Run service that lives for hours absolutely can.

Going deeper

The `up` synthetic metric — the design choice that built an ecosystem

Prometheus emits a synthetic gauge up{job="<jobname>", instance="<addr>"} for every target in its config. The value is 1 if the most recent scrape succeeded, 0 if it failed (target unreachable, HTTP error, parse error, timeout). This single metric is the foundation of every "service is down" alert in the Prometheus-native ecosystem and the reason Prometheus alerts are typically more accurate than alerts in pre-pull-era systems. The mechanism is mundane — Prometheus already has to know when a scrape succeeded to record its samples, so synthesising a metric from that signal costs nothing — but the consequence is large: a single line ALERT InstanceDown IF up == 0 for 5m covers the entire fleet, no per-service registration needed. In pure push systems, you have to maintain a registry of expected senders separately (Consul, K8s endpoints, a service mesh). The Datadog Agent solves this by being a sidecar (so its own host being up implies its targets are accessible), Honeycomb solves it via the Honeycomb-host heartbeat, and OpenTelemetry's collector emits its own otelcol_receiver_accepted_metric_points counter that you can alert against absence of. Every push system in production today has reinvented up in a different namespace.

Why scrape-interval skew matters more than push burst absorption

If 1,400 pods all expose /metrics and Prometheus scrapes them all at exactly :15, :30, :45, :00, the network sees a 336 MB burst at second-tick boundaries and four idle seconds in between. This is the synchronised-scrape problem, and the fix is --scrape.interval-jitter — Prometheus by default jitters each target's scrape time by a random offset within the interval, so the load is spread evenly. Push systems have a symmetric problem: most StatsD clients flush at a configurable flush_interval (typically 10 s), and if the entire fleet boots at the same time the flushes synchronise. Etsy's original StatsD docs explicitly recommend jittering the flush; the Datadog Agent does this by default. The jitter trick is the same on both sides — the difference is who configures it. In pull, the collector's config covers the entire fleet at once; in push, every producer's client library has to be configured correctly, and one misconfigured app emitting a synchronised flush is enough to overwhelm the central collector. Razorpay's 2023 platform-team postmortem on a StatsD outage was traced to one team that turned off jitter in their client config "to make graphs cleaner during testing" and forgot to re-enable it before production; on UPI peak-load Friday, that one service synchronised with the rest of the fleet and contributed a 90 MB burst on the same socket second.

The Pushgateway anti-pattern — and why the docs recommend against it

The Prometheus Pushgateway is a long-lived HTTP server that accepts pushes from short-lived jobs (POST /metrics/job/foo) and exposes them to be pulled by Prometheus. It is the Prometheus team's official answer to "how do I monitor batch jobs", and the docs explicitly call out three anti-patterns to avoid. First, never push service-level metrics to it — the Pushgateway has no concept of "the producer died" because the metric value sits there until explicitly cleared, so a dead service looks the same as a healthy idle one. Second, never use it as a centralised metric ingest point for multiple services — it serialises all pushes and becomes a bottleneck above ~10k pushes/sec. Third, always push with a stable instance label so reruns of the same batch overwrite rather than accumulate, otherwise old runs leak into the gauge. The honest read of the Pushgateway is that it is a workaround for the one structural weakness of pull (short-lived jobs), and the workaround has its own structural weakness (no liveness signal) which you have to solve at the application layer. A 2024 Mimir-fleet postmortem at a Bengaluru fintech tracked a 4-hour metric staleness incident to a Pushgateway that had been receiving stale gauges for 11 days from a job that was renamed — every dashboard panel showed the old gauge value and nobody noticed until a customer complained that their settlement report was 11 days out of date.

OTLP-push as the new middle ground

OpenTelemetry's metrics SDK with the OTLPMetricExporter is what most cloud-native shops are converging on for the push side, and it deliberately blunts every traditional push weakness. Transport: TCP via gRPC or HTTP/protobuf, not UDP — so the producer sees send failures and can retry. Batching: the SDK aggregates locally for export_interval_millis (default 60 s) before pushing, so per-event overhead is amortised across thousands of events per batch — close to pull's per-scrape efficiency. Liveness: the OTel collector emits its own otelcol_receiver_accepted_metric_points and otelcol_receiver_refused_metric_points counters, which give you a per-receiver heartbeat. Backpressure: gRPC's flow-control surfaces "the collector can't keep up" as RESOURCE_EXHAUSTED errors at the producer, who can then drop, queue, or sample locally — none of which a UDP-push system can do. The result is a push protocol that recovers most of pull's robustness while keeping push's flexibility for short-lived jobs and event-driven workloads. It is not free — running an OTel collector per region adds operational overhead — but for fleets that span both long-running services and Lambda-shaped workloads, it is the only design that handles both without two parallel pipelines.

The hybrid in practice — what every large Indian shop actually runs

No real production fleet runs pure push or pure pull. Razorpay's payment-platform fleet pulls from long-running services (Prometheus scraping every 15 s, 8M active series), pushes from Lambda-shaped jobs (CloudWatch Embedded Metric Format, batched), and uses OTLP-push for distributed tracing (Tempo via OTel collector). Hotstar's video-delivery fleet pulls from origin servers and edge proxies (Mimir scraping a long target list) and pushes from per-stream session metrics (a custom UDP push to a Cassandra-backed time-series store, because the cardinality is too high for Prometheus). Zerodha's order-router fleet pulls almost everything but pushes order-event-rate metrics through Kafka-as-event-bus so the metric path doubles as the analytics path. The lesson is not "pick one". The lesson is "pick per-workload" — and to pick correctly you have to know which failure mode each workload faces hardest. A long-running stateless API handler faces "did this pod just die" first → pull. A Lambda fraud-check fires once per UPI transaction and exits → push. An exotic hot loop emits 50 KHz events per process and the collector has to keep up → batched push (or a side-car aggregator that pulls from the process and pushes upstream). The real-world architecture is layered, and the push-vs-pull choice happens at every layer independently.

Where this leads next

Prometheus TSDB internals — the storage engine that pull's scrape output lands in. Understanding how Prometheus stores 1.3 bytes per sample is what makes the once-per-15-second scrape model financially defensible.
Cardinality: the master variable — both push and pull blow up the same way when label cardinality explodes. The collection model doesn't change the cardinality budget, only where the explosion shows up first.
Long-term storage: Thanos, Cortex, Mimir — these systems all assume a pull-shaped scrape output as their input format. Push-collected metrics arrive at Mimir via the same /api/v1/push endpoint Prometheus uses for remote-write, which is itself a push protocol layered on top of pull's data shape.

The collection-model decision is rarely the dominant cost in an observability stack — cardinality, retention, and query throughput are bigger budget items — but it is the choice you cannot easily reverse later. A fleet that started pull-first has its alerts, dashboards, and SLO definitions wired around the up metric and the rate() function semantic; switching to push means rebuilding all of those. A fleet that started push-first has its credentials distributed and its short-lived-job pipeline already wired; switching to pull means restructuring discovery and adding /metrics endpoints to every service. Most fleets that try to switch end up running both for years and paying the operational cost of two pipelines.

The sharpest framing of the choice — sharper than "which is more robust" or "which scales better" — is which liveness model do you want. Pull says "the collector defines truth and tells you when truth is missing". Push says "the producer defines truth and the consumer trusts what arrives". Both work; both have failure modes; the failure modes are visible in different dashboard panels and different alert routes. Once you know which failure mode you would rather catch a 03:00 page about, the rest of the design follows.

References

Brian Brazil, "Pull doesn't scale — or does it?" (Robust Perception, 2016) — the canonical defence of pull from one of Prometheus's earliest core contributors. Makes the liveness-detection argument more crisply than any other source.
Etsy, "Measure Anything, Measure Everything" (2011) — the StatsD origin post. Push-first design choice and rationale.
Prometheus Pushgateway documentation, "When to use the Pushgateway" — explicitly enumerates the anti-patterns and why the docs themselves recommend against using it for service metrics.
OpenTelemetry Specification — Metrics SDK — the OTLP-push protocol and aggregation-temporality choice that underlies the modern push-with-batching model.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — frames push-vs-pull as a question of liveness ownership rather than transport choice.
Datadog Agent architecture (engineering blog) — a real production push-with-batching architecture and the engineering trade-offs that shape it.
Gorilla compression: the key insight — the storage-side reason that pull's once-per-15-second scrape model is affordable.
Prometheus TSDB internals — what the scrape output lands in and how the data model assumes pull-shaped inputs.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests
python3 push_vs_pull.py
# Expected: pull per-event ~1-5us, push per-event ~15-30us;
# pull scrape ~5KB/15s; push ~50KB/1000 events; some UDP loss under load.
# To see the burst-vs-sustained difference, push 100k events and watch
# the kernel UDP-receive-error counter:  netstat -su | grep "packet receive errors"