Agentless observability claims
Aditi runs platform engineering at a Mumbai broker that processes ₹4,200 crore of equity trades on a typical Tuesday. The vendor on the procurement call has spent fifteen minutes explaining their agentless observability platform — "no per-service instrumentation, no SDK adoption, no sidecars, your developers do nothing, you just install us and the traces appear". She likes the pitch; her team has spent six months trying to roll OpenTelemetry SDKs across 90 microservices and is at 41% coverage. Then she asks: "If it's agentless, what runs on the node?" The answer is a thirty-second pause, then "a small eBPF probe — but it's not really an agent, it's a kernel-level collector". She mutes herself, opens a tab to the vendor's docs, finds the install command (helm install ... --set privileged=true ...), and recognises the shape immediately. This is a DaemonSet. It runs CAP_SYS_ADMIN. It mounts /sys/kernel/debug. It is exactly an agent — they have just renamed it.
"Agentless observability" almost never means zero agents — it means a single privileged eBPF DaemonSet that watches every process on the node, replacing N per-service SDKs with one shared collector. The trade-off is real: you stop coordinating with app teams, but you take on a privileged kernel-tap, lose application-level context (business labels, MDC fields, user-defined attributes), and shift the security review from N codebases to one DaemonSet running with CAP_SYS_ADMIN. This chapter shows what is honestly happening in the kernel, in your YAML, and in your threat model when a vendor says the word.
What "agentless" honestly means in 2026
The word agentless came from a different era of monitoring. In the SNMP / WMI world of the 2000s, "agentless" meant the monitoring system polled the target over a network — no software was installed on the monitored host. Solarwinds polling a Cisco switch over SNMP is genuinely agentless. CloudWatch reading EC2 metadata via the AWS API is agentless. Probing an HTTPS endpoint and timing the TLS handshake is agentless. None of these run code inside the system being observed.
The eBPF generation borrowed the word and silently changed its meaning. When a 2026-era vendor — Datadog Universal Service Monitoring, Dynatrace OneAgent, New Relic Pixie, Grafana Beyla, Cilium Tetragon, Splunk Observability Cloud — says agentless, they mean one shared agent per node instead of one agent per service. The shape is: a DaemonSet pod runs on every Kubernetes node; that pod attaches eBPF probes to syscalls, kprobes, uprobes, and tracepoints across the whole kernel; every process on the node is observed by the same eBPF program. From the application team's point of view there is no agent — they ship their service unchanged, no SDK, no sidecar, nothing in their Dockerfile. From the platform team's point of view there absolutely is an agent; it just happens to be theirs.
Why the rebrand mattered enough to invent the word: enterprise procurement, especially in Indian banks and regulated fintechs (HDFC, ICICI, Razorpay's RBI-supervised flows), has institutional resistance to "installing an agent on the production host". A 2019 InfoSec policy that says "no third-party agents in container images" was written when "agent" meant "Datadog Agent linked into the JVM with -javaagent:dd-java-agent.jar". The eBPF-DaemonSet shape technically sidesteps that policy — nothing is linked into the application — but the same security concern applies: the DaemonSet runs with broader kernel-level privileges than any per-service agent ever did. The marketing word agentless exists precisely to navigate that institutional friction; the engineering substance is "we replaced N normal agents with one privileged super-agent". Whether that trade is good for you depends on what you actually need to monitor and how your security team feels about CAP_SYS_ADMIN.
What you gain, what you lose: the honest trade matrix
The agentless shape has real wins. It also has real losses. The vendor pitch deck only lists the wins. Here is the matrix as a platform engineer would write it after running both shapes in production.
You gain: zero per-service code change, instant 100% fleet coverage on day one, no SDK version coordination across 90 services, no per-language instrumentation library to maintain, no OpenTelemetry-Python==1.21 vs 1.23 skew breaking your trace exporter mid-deploy, no per-team rollout plan, no "this team is on Java 8 and we can't update their bytecode agent", no sidecar resource overhead replicated N times. For a platform team trying to roll observability across an organisation of 30+ engineering squads, the coordination win is the largest one — the political cost of asking every team to add an SDK dependency is what kills most observability programs in the first place.
You lose: application-level context. The eBPF probe sees syscalls (sys_write, sys_read, connect, accept), it sees TCP-level events (HTTP/1 request lines on port 8080, gRPC frames if it can parse them), it sees TLS handshakes (but not decrypted bytes unless the vendor implements uprobe hooks into OpenSSL's SSL_write symbol — which Datadog and Pixie do, with caveats). It does not see your business attributes — customer_tier="vip", merchant_category="fuel", loan_amount_inr=85000, kyc_status="verified" — because those live in your application's variables, never crossing the syscall boundary. An OTel SDK in-process can attach those as span attributes; an eBPF agent observing from the kernel cannot. The richer the question your team wants to ask of the trace data ("which VIP merchants saw p99 > 500ms during the IPL final?"), the worse the agentless model performs.
You also lose decrypted HTTPS payload visibility in the simple case. eBPF can see TLS frames going by but cannot decrypt them — the encryption key lives in user-space inside OpenSSL or BoringSSL or Go's native TLS. Vendors solve this with uprobe-on-SSL_write (intercept the plaintext just before encryption), but that requires the binary to be unstripped and the symbol to be findable. Static-linked Go binaries built with -trimpath -ldflags=-s -w (the default for many Indian fintech container images, because they want smaller artefacts) are notoriously hard for uprobe-based unwinders to introspect. Your "agentless" tool shows you a TCP connection to a backend; it does not show you what was in the JSON.
Why the uprobe-on-SSL_write story is a fragile foundation: when the agent attaches a uprobe to SSL_write in libssl.so, it is asking the kernel to fire its eBPF program every time any process on the node calls that symbol with the cleartext buffer. This works beautifully on a Python/Flask service that links libssl.so.1.1 dynamically — Razorpay's older Python services, for instance, are exactly this shape. It works poorly on Go services that compile their own TLS stack into the binary (Go's crypto/tls does not call libssl), where the agent must instead uprobe Go-runtime symbols like crypto/tls.(*Conn).Write — and those symbols vanish when you build with -ldflags=-s -w. It works not at all on a Rust service using rustls (which bundles a different TLS stack), or on a Java service using JSSE (the JVM's native TLS). The honest position: uprobe-on-SSL_write covers maybe 60% of a typical 2026 fleet, and the missing 40% is precisely the fleet's modern, performance-tuned services. When a vendor demos decrypted-L7 traces, ask them to demo on a Go binary built with the same flags as your production image — that is the test that filters serious tools from marketing.
What "agentless" looks like in your YAML
Marketing aside, the procurement reality is that you read the vendor's Helm chart and decide whether you want what it asks for. Here is a representative snippet of what an eBPF-based "agentless" tool's DaemonSet looks like — paraphrased from the published Helm charts of Datadog Universal Service Monitoring, New Relic Pixie, and Grafana Beyla as of late 2025. The exact field names differ per vendor; the shape does not.
# inspect_agentless_daemonset.py — parse a vendor Helm-rendered manifest, audit privileges
# pip install pyyaml requests
import sys, yaml, json
from collections import Counter
# Render the vendor's chart locally: helm template <vendor>/observability-agent > rendered.yaml
RENDERED = sys.argv[1] if len(sys.argv) > 1 else "rendered.yaml"
with open(RENDERED) as f:
docs = list(yaml.safe_load_all(f))
ds = [d for d in docs if d and d.get("kind") == "DaemonSet"]
print(f"DaemonSets in chart: {len(ds)}")
PRIV_FIELDS = ["privileged", "hostPID", "hostNetwork", "hostIPC", "allowPrivilegeEscalation"]
SENSITIVE_MOUNTS = ["/sys", "/proc", "/var/run/docker.sock", "/var/lib/kubelet"]
for d in ds:
name = d["metadata"]["name"]
spec = d["spec"]["template"]["spec"]
print(f"\n=== DaemonSet: {name} ===")
for field in PRIV_FIELDS:
if field in spec:
print(f" pod-spec.{field:<25} = {spec[field]}")
for c in spec.get("containers", []):
sec = c.get("securityContext", {}) or {}
caps_add = (sec.get("capabilities") or {}).get("add") or []
print(f" container[{c['name']}].privileged = {sec.get('privileged', False)}")
print(f" container[{c['name']}].capabilities.add = {caps_add}")
print(f" container[{c['name']}].runAsUser = {sec.get('runAsUser', 'unset')}")
mounts = []
for c in spec.get("containers", []):
for m in c.get("volumeMounts", []):
mounts.append(m["mountPath"])
sensitive_hits = [m for m in mounts if any(s in m for s in SENSITIVE_MOUNTS)]
print(f" sensitive host mounts = {sensitive_hits}")
Sample run on a representative vendor's rendered chart:
DaemonSets in chart: 1
=== DaemonSet: vendor-observability-agent ===
pod-spec.hostPID = True
pod-spec.hostNetwork = True
pod-spec.hostIPC = False
container[agent].privileged = True
container[agent].capabilities.add = ['SYS_ADMIN', 'BPF', 'PERFMON', 'NET_ADMIN', 'IPC_LOCK']
container[agent].runAsUser = 0
container[init-bpf-fs].privileged = True
container[init-bpf-fs].capabilities.add = ['SYS_ADMIN']
container[init-bpf-fs].runAsUser = 0
sensitive host mounts = ['/sys/kernel/debug', '/sys/fs/bpf', '/proc', '/var/run/docker.sock']
Walk through what that output is telling you. The hostPID: True line means the agent's process namespace is shared with the host; the agent's /proc shows every process on the node, not just its own pod. This is required for uprobe-based instrumentation — the agent needs to read the executable file and resolve symbols for every binary on the node — but it also means the agent can kill -9 any other process. The capabilities.add: ['SYS_ADMIN', 'BPF', 'PERFMON', ...] line is the privilege escalation: CAP_SYS_ADMIN is the kernel "give me everything" capability, including the ability to load BPF programs, attach to any kprobe, and read arbitrary kernel memory. The CAP_BPF and CAP_PERFMON capabilities exist in modern kernels (5.8+) precisely to split SYS_ADMIN so eBPF tools don't need full root — vendors often request both SYS_ADMIN and BPF for backward compatibility, even though BPF + PERFMON would be sufficient. The /sys/kernel/debug mount is what lets the agent read tracefs (the kernel's tracepoint and kprobe interface). The /var/run/docker.sock mount — present in some vendors' charts for "container metadata enrichment" — is the worst single line in the manifest; it gives the agent the ability to start any container on the node with any privileges. Some vendors have removed it in favour of the Kubernetes API; others have not.
Why the security review is harder than the SDK story it replaces: in the SDK world, every team's requirements.txt lists opentelemetry-sdk==1.21.0, and a CVE in the OTel exporter affects only the services that pulled it in. A clever attacker who compromises the OTel SDK in the checkout service can exfiltrate traces from checkout — bad, but bounded. In the agentless DaemonSet world, a CVE in the eBPF agent (and there have been several — CVE-2022-23222 in the BPF verifier, CVE-2023-2156 in the netfilter eBPF helper) compromises one privileged process per node, which can read memory from every other process on that node, including secrets in /proc/<pid>/environ and decrypted database connections. The blast radius of an agentless-collector compromise is the entire node; the blast radius of an SDK compromise is one service. This is not a theoretical concern — it is why every Indian regulated fintech that has deployed eBPF observability runs the DaemonSet in a separate VPC, behind admission-controller policies, with strict image-pinning. The vendors who handle this conversation honestly in their docs (Cilium publishes their threat model; Grafana Beyla publishes a hardened-deployment guide) earn the deals.
What the agent actually does at runtime — a 60-second walkthrough
Strip away the marketing and watch a typical eBPF observability DaemonSet boot on a Linux 6.5 node. The sequence below is what a bpftool prog show snapshot would print 60 seconds after the pod starts on a node hosting three application containers (a Flask app, a Go gateway, a Postgres sidecar).
# bpftool prog show — taken 60s after agentless-collector pod start
17: kprobe name tcp_sendmsg_e tag a4f1b6... loaded_at 2026-04-25T11:20:14
btf_id 412 jited 1 size 1840 attached_to tcp_sendmsg
18: kprobe name tcp_recvmsg_x tag c8e2d3... loaded_at 2026-04-25T11:20:14
btf_id 412 jited 1 size 2104 attached_to tcp_recvmsg
22: kprobe name sched_switch tag 7b3a09... loaded_at 2026-04-25T11:20:15
btf_id 412 jited 1 size 968 attached_to __schedule
24: tracepoint name sys_connect tag 4d2f81... attached_to syscalls/sys_enter_connect
27: uprobe name ssl_write_e tag 9e7c14... attached_to /usr/lib/x86_64-linux-gnu/libssl.so.3:SSL_write
28: uprobe name ssl_write_r tag 9e7c14... attached_to /usr/lib/x86_64-linux-gnu/libssl.so.3:SSL_write+ret
30: perf_event name profile_hz99 tag 1a5b2e... attached_to perf:cpu-clock@99Hz
# bpftool map show
33: hash name flow_state flags 0x0 max_entries 65536 memlock 5.2MB
34: hash name conn_meta flags 0x0 max_entries 8192 memlock 1.1MB
38: ringbuf name events_to_user flags 0x0 max_entries 8388608 memlock 8.0MB
That snapshot is the agent's actual work: 7 BPF programs attached across kprobes, tracepoints, uprobes, and perf events; 3 BPF maps (two hash tables for connection state, one ring buffer for shipping events to userspace); roughly 14MB of kernel memory locked. The agent's userspace process polls the ring buffer, batches events, and exports them via OTLP/gRPC to the vendor backend. Everything the dashboard later shows you — service-graph edges, p99 per-endpoint, decrypted HTTP request lines — is reconstructed from this stream.
# inspect_ringbuf_throughput.py — measure how much an "agentless" agent ships per second
# pip install pyperf psutil
import time, psutil, subprocess, json
from collections import defaultdict
# Find the agent pod's container PID on the node
def find_agent_pid(name_substr: str = "observability-agent") -> int:
for p in psutil.process_iter(["pid", "name", "cmdline"]):
cmd = " ".join(p.info["cmdline"] or [])
if name_substr in cmd:
return p.info["pid"]
raise RuntimeError(f"no process matching {name_substr}")
pid = find_agent_pid()
print(f"agent PID: {pid}")
# Sample /proc/<pid>/io for write_bytes (egress to vendor backend)
def io_snap(pid: int) -> dict:
with open(f"/proc/{pid}/io") as f:
return dict(line.strip().split(": ", 1) for line in f if ": " in line)
a = io_snap(pid)
time.sleep(10)
b = io_snap(pid)
dt = 10.0
write_bps = (int(b["write_bytes"]) - int(a["write_bytes"])) / dt
read_bps = (int(b["read_bytes"]) - int(a["read_bytes"])) / dt
# CPU usage in same window
cpu_pct = psutil.Process(pid).cpu_percent(interval=1.0)
mem_mb = psutil.Process(pid).memory_info().rss / 1024 / 1024
print(f"agent egress (to vendor backend): {write_bps/1024:8.1f} KB/s")
print(f"agent ingress (from kernel ring): {read_bps/1024:8.1f} KB/s")
print(f"agent CPU pct on this node: {cpu_pct:5.1f}%")
print(f"agent RSS: {mem_mb:5.1f} MB")
Sample run on a 4-vCPU node hosting a moderate-traffic Flask + Go pair (≈ 2,000 RPS combined):
agent PID: 14823
agent egress (to vendor backend): 412.6 KB/s
agent ingress (from kernel ring): 58.1 KB/s
agent CPU pct on this node: 1.7%
agent RSS: 187.3 MB
The 1.7% CPU and 187MB RSS is the floor of an "agentless" rollout. Multiply by your node count: a 200-node fleet pays roughly 3.4 cores and 37GB of RAM that does not show up in any application team's quota — it shows up in the platform team's. This is the cost-shift the marketing buries: app teams stop carrying SDK overhead per-pod (typically 0.3-0.8% per service), the platform team starts carrying DaemonSet overhead per-node. The total cluster overhead is usually similar; the org-chart line where the cost lands is different.
Real Indian production stories — when "agentless" was right and when it was wrong
The shape of the trade-off changes by what you are trying to monitor and what your team actually needs. Three Indian production cases illustrate.
Hotstar IPL 2024 — agentless was right. Hotstar's streaming infrastructure spans 80+ microservices, 4 different language runtimes (Go, Java, Node, C++), and 30+ deployment teams. During the 2024 IPL, the platform observability team wanted L4/L7 visibility — which service was talking to which, what protocols were on the wire, where the connection-pool exhaustion was — across the whole fleet without coordinating with 30 teams during a launch window. They deployed a Cilium Hubble + a vendor eBPF tracer side by side; both gave them HTTP request rates, p99 latencies, and DNS query volumes within hours of install. The platform team did not need business attributes (tournament_id, match_id, user_tier) because the existing OTel SDK rollout already had those for the booking and payments paths. Two-tool hybrid: eBPF for the L4/L7 layer, SDK for the business layer. The agentless tool replaced what would have been 6 months of per-team SDK rollouts for fleet-wide protocol visibility. They paid for it; it earned its keep.
Razorpay merchant-tier debugging, 2023 — agentless was wrong. Razorpay's payments team needed to debug a performance regression that was specific to Tier-1 merchants on the UPI Mandate path. The eBPF agent showed them: HTTP requests to the payments-service from the gateway, with average latency. It did not show them: which requests came from Tier-1 merchants vs Tier-2, which were UPI Mandate vs UPI Collect, which had the new fraud-flag enabled. All of those attributes were Python variables inside payment_handler.py, never crossing the syscall boundary. The team spent a week trying to extract tier signals from URL patterns and HTTP headers (Tier was set in a header, but only on internal calls, not from gateway), and eventually rolled the OTel SDK on the payments path to get the right span attributes. The eBPF tool was useful for confirming "the regression is on the UPI Mandate endpoint", but the diagnosis required SDK-level fidelity. The lesson: when your debugging questions are framed in your business vocabulary, eBPF observability cannot answer them.
Cred reconciliation pipeline, 2025 — agentless was a security blocker. Cred's compliance team blocked the rollout of an eBPF-based "agentless" observability tool because the vendor's DaemonSet asked for CAP_SYS_ADMIN, hostPID: true, and /var/run/docker.sock mount, and the same node also ran the rewards-card-tokenisation service, which holds RBI-restricted data. The compliance team's argument was simple: a privileged DaemonSet on the tokenisation node is itself a tokenisation-data exposure path. The platform team negotiated a deployment where the agent ran only on non-tokenisation nodes; the agent's coverage of the fleet dropped to 60%; the procurement was downgraded from a full-fleet contract to a partial-fleet contract; the vendor's "100% coverage in 1 day" pitch was technically true but commercially halved. This is the failure mode platform engineers should rehearse before a procurement call: the more privileged the agent, the more often security review will carve out exactly the nodes you most wanted to monitor.
The pattern across all three: the technical shape of the agentless trade-off is real and measurable; the organisational shape (who is your security team, who owns the regulated nodes, who has SDK rollout fatigue) determines whether the trade is good for you. The vendor cannot answer this question; only your platform team can.
How to evaluate a vendor's agentless claim — a checklist
Take this list into your next vendor call. The order matters; questions early in the list rule out vendors faster.
- "What runs on the node?" Make them say "DaemonSet" or "host process" or "init-script". If they say "nothing", they are either a SaaS-API integration (like Cloudflare Logpush) or they are not telling the truth.
- "What capabilities does it request?"
CAP_SYS_ADMINis the worst answer;CAP_BPF + CAP_PERFMON + CAP_NET_ADMINis acceptable;none + a non-privileged eBPF API likebpf_unprivileged_disabled=0`` is rare and excellent. - "What kernel versions are supported?" Most eBPF-based tools require kernel ≥4.18 for
CO-RE(Compile Once, Run Everywhere); kernel ≥5.8 for split capabilities; kernel ≥6.0 for the modernkfuncinterface. If your fleet runs CentOS 7 (kernel 3.10), you cannot use any of these tools, period. - "What does it see on TLS-encrypted traffic?" "Connection metadata only" is the honest answer; "decrypted L7 with
uprobeonSSL_write" is more powerful but only works on dynamically-linked, unstripped binaries. Static-linked stripped Go is the case where most claims fall apart — ask them to demo on your actual binary. - "What happens to my business attributes?" If the answer is "we infer them from URL paths and HTTP headers", confirm whether your team's actual labels (
merchant_tier,kyc_status,loan_amount_inr) cross the wire as headers. If they live in process memory only, the vendor cannot see them. - "What is the blast radius if your agent crashes or has a CVE?" Honest answer: "loss of observability across the whole node, plus a privileged-process compromise vector". Dishonest answer: "we have never had a CVE".
- "How do you handle Pod restart? Does the agent need to restart?" eBPF programs survive process restart (they live in the kernel); the agent process must restart to re-attach. The right answer is "the eBPF programs persist via a pinned BPF filesystem; agent restarts re-attach in <2s with no observability gap". The wrong answer is "we lose data for 30s during agent restart".
- "What is your
oom_score_adj?" Privileged DaemonSets that run withoom_score_adj: 0(default) get OOM-killed first when the node is under memory pressure — exactly when you most want observability. Production-grade tools setoom_score_adj: -1000or use a guaranteedResources.requests. Ask for the YAML. - "Can I run your agent in a non-
hostPIDmode for sensitive nodes?" A "yes" with reduced functionality is the honest answer; "no, we require hostPID" tells you the design is fragile to security-review carve-outs. - "Show me the part of your docs that says what
agentlessactually means in your terminology." Vendors who write this honestly (Cilium, Grafana Beyla, Polar Signals) get points. Vendors whose docs only say "no agent required" without the per-node-DaemonSet clarification lose points.
Common confusions
- "Agentless means no software runs on my node." No. Modern "agentless observability" means a single shared agent (eBPF DaemonSet) replaces N per-service agents. Genuinely agentless monitoring (CloudWatch reading EC2 metadata, Cloudflare's edge logs, an HTTPS endpoint prober) does exist; eBPF-based observability is not it. The distinction matters because the threat model is completely different.
- "eBPF is always safer than SDK-based instrumentation." False in two directions. eBPF concentrates privilege into one DaemonSet running with
CAP_SYS_ADMIN; an SDK distributes it across N services running as non-root. A CVE in the SDK affects one service; a CVE in the eBPF agent affects every process on the node. The right framing is different threat models, not one is safer. - "Agentless means zero performance overhead." Wrong. eBPF agents do have measurable overhead — kprobe firing rate × per-fire cost (~500ns–2μs depending on stack depth). On a high-syscall workload (a busy Redis on a node) the overhead can be 1–4% of CPU. Pyroscope-eBPF and Parca both publish numbers; vendor claims of "<0.1% overhead" usually measure on idle workloads, not your IPL-final-traffic case.
- "All eBPF observability tools are the same shape." They are not. Cilium / Hubble target L4/L7 network observability (service-mesh-shaped data); Pixie targets in-cluster live forensics with a 24h ring buffer; Parca targets continuous CPU profiling; Datadog's eBPF tracer targets distributed-trace inference from socket events. Each of these requires different probe types (kprobe, uprobe, tracepoint, perf_event), different kernel features, and produces different data shapes. "We use eBPF" tells you almost nothing about what the tool actually does — see
/wiki/parca-pixie-pyroscopefor one slice of this. - "If the SDK rollout is hard, agentless is automatic." No — agentless rollouts have their own coordination cost, just borne by a different team. Instead of asking 30 squads to add a
pip install opentelemetry-sdkline, you are asking your platform team to deploy a privileged DaemonSet, get it through security review, integrate it with your CI/CD, configure SLOs against its uptime, and respond to its alerts. The work shifts; it does not vanish. - "
uprobeonSSL_writedecrypts everything." Not in 2026. Static-linked binaries (Go's default for production), stripped binaries (size optimisation), binaries with custom TLS stacks (Go'scrypto/tls, Rust'srustlsinstead of OpenSSL), and binaries inside scratch container images all defeat or partially defeatuprobe-based decryption. The vendor's demo on a glibc-linked Python service shows you their best case; ask them to demo on your actual production image.
Going deeper
Genuine agentless monitoring: the patterns that earned the word honestly
Before eBPF rebranded the term, agentless had clear technical meaning. Five patterns earn the name without rebranding:
Endpoint probing. A monitoring service in a different VPC issues HTTP/TCP/DNS probes against your service. Pingdom, Datadog Synthetics, Catchpoint, Atlas-Operator (the Indian RailTel / NIXI weather-station monitor that runs from outside customer infra) all do this. Zero code on the target. Genuinely agentless. Limited to external observability — you see what a client sees, not what the service sees.
Cloud-API metadata polling. AWS CloudWatch reading EC2 hypervisor counters, Azure Monitor reading VM agent telemetry that Microsoft already runs, GCP Cloud Monitoring reading Compute Engine metadata. The cloud provider runs the agent (theirs, not yours); your monitoring tool reads the API. Your application sees nothing on its filesystem or in its process tree. Genuinely agentless from your point of view; not agentless at all from the cloud provider's.
SNMP polling. Network gear (Cisco switches, Juniper routers, F5 load balancers) exposes SNMP MIBs over UDP/161; the monitor polls. The device runs SNMP daemon (which is its agent, not the monitoring tool's), the monitor never runs code on the device. The classical "agentless network monitoring" lineage. Solarwinds, ManageEngine OpManager, Zabbix.
Log shipping from a managed service. AWS RDS publishes slow-query logs to CloudWatch Logs; you read them via the AWS API. Cloudflare publishes Logpush JSON to your S3. The managed service is doing the shipping; your tool reads. Genuinely agentless for the consumer.
eBPF host-side from outside the workload's container. This is the gray zone. If the observability DaemonSet runs on a separate node from the workload (e.g. a "monitoring tier" of nodes that taints away workloads), then from the application node's point of view there is no agent. But the observability tool then can only see the wire, not the application internals — it has converted to true endpoint probing. Some vendors offer this as a "high-security mode"; coverage drops sharply.
The honest taxonomy: only the first four are genuinely agentless. Modern eBPF tools are low-friction agents, not zero agents. Calling them agentless is a positioning move.
What the kernel actually needs from the agent — the privilege math
The kernel-level capabilities required by an eBPF observability agent are not arbitrary; they map to specific eBPF features. Understanding this lets you push back on vendors who request more than they need. The minimum viable set for a tracing/profiling agent:
CAP_BPF(kernel 5.8+) — load BPF programs of any type. Required for any eBPF tool.CAP_PERFMON(kernel 5.8+) — attach to perf events, kprobes, tracepoints. Required for sampling profilers and most tracers.CAP_NET_ADMIN— required only for socket / TC / XDP attach (network observability). A pure profiler does not need this.CAP_SYS_PTRACE— required forprocess_vm_readv(cross-process memory read), used by language-aware unwinders (Python, JVM). Optional unless the agent does symbol resolution from outside the target process.- Mount of
/sys/fs/bpf— for pinned BPF maps that survive agent restart. - Mount of
/sys/kernel/debug(tracefs) — for kprobe/tracepoint enumeration.
Vendors who request CAP_SYS_ADMIN are asking for the historical pre-5.8 superset because their code path supports both old and new kernels and they took the lazy path. A modern, security-conscious observability tool runs with CAP_BPF + CAP_PERFMON only on kernel 5.8+ fleets. If your fleet is kernel 5.8+, you should require this; it cuts the agent's privileges from "everything" to "just BPF and profiling".
Why genuine zero-instrumentation traces are mostly a fiction
The promise of eBPF-based "trace inference" — building a full distributed trace from socket events alone, no SDK — has been pitched since 2021. The reality in 2026 is that it works for simple synchronous HTTP/gRPC call graphs and breaks on everything else. The hard cases:
Asynchronous task continuation. Your Python service receives an HTTP request, enqueues a job onto Celery via Redis, returns 202. The eBPF tracer sees the inbound HTTP and the Redis LPUSH, but it cannot connect them to the worker that picks up the job 30 seconds later — the trace context lives in the Celery task body, not in any wire frame the tracer can parse without app-specific knowledge.
Connection pooling. The eBPF tracer sees a TCP connection from process A to process B; if A holds a pool of 50 connections to B and multiplexes 5,000 logical requests across them, the tracer cannot tell which logical request is which without parsing the HTTP/2 stream-id, which requires it to track HPACK state per-connection. Some vendors do this; many do not. Ask explicitly.
Background work. A cron job inside the service that runs every 60s does not have a parent span unless the SDK injects one. The eBPF tracer sees the syscalls but cannot construct a trace ID for work that has no incoming request.
Cross-process trace context propagation in non-standard headers. The W3C traceparent header is well-known, but many internal Indian fintech services use custom headers (X-Razorpay-Request-Id, X-Trace-Token) for legacy reasons. The eBPF tracer that only knows W3C will not stitch these — see /wiki/b3-w3c-trace-context for the propagation lineage.
For these cases, the SDK is doing real work that the eBPF agent cannot replicate. Vendors who acknowledge this in their docs (Cilium says "for full distributed tracing, combine Hubble with OTel SDK") earn trust. Vendors who pitch eBPF as a complete OTel replacement are selling a half-truth.
The "vendor parses your wire protocol" problem
A subtle vendor-trap: every "agentless" tool that promises L7 visibility (HTTP, gRPC, MySQL, Redis, Kafka, Postgres) implements a wire-protocol parser inside its eBPF program. These parsers are by necessity simplifications — gRPC supports HTTP/2 with HPACK header compression and bidirectional streaming; the eBPF parser cracks the HTTP/2 frame header and skips streaming for performance. MySQL has a binary protocol with prepared-statement caching; the eBPF parser handles COM_QUERY but often skips COM_STMT_EXECUTE. The result: the vendor's dashboard shows you "MySQL queries: 4,200/sec" and the number is the queries the parser understood, not the total queries on the wire. Prepared-statement-heavy workloads (most ORMs) can show 60-80% under-counting in this mode. Always cross-check the vendor's L7 numbers against a known-good SDK-emitted counter, in the first month, on a workload you control. The mismatch tells you what the vendor's parser handles vs misses.
Reproduce this on your laptop
# Audit a vendor's "agentless" Helm chart for actual privilege footprint
helm repo add <vendor> https://<vendor-helm-repo>
helm template <vendor>/observability-agent > rendered.yaml
python3 -m venv .venv && source .venv/bin/activate
pip install pyyaml
python3 inspect_agentless_daemonset.py rendered.yaml
# Expect to see: hostPID, hostNetwork, CAP_SYS_ADMIN or CAP_BPF, /sys/kernel/debug mount
Where this leads next
The next chapter — /wiki/ebpf-for-network-observability-cilium-hubble — takes the agentless eBPF shape and applies it to the network layer specifically: Cilium and Hubble are the most operationally honest eBPF observability tools in 2026, and their docs describe the trade-offs in this chapter cleanly. Reading them after this chapter, you will recognise their design choices as conscious responses to the privilege-and-context losses described here.
After that, /wiki/ebpf-limitations-in-production is the more skeptical companion to this chapter — what eBPF cannot do, what it does poorly at scale, and where the kernel-version coupling bites you in 2026 production. The pair (this chapter on marketing claims, that chapter on engineering limits) is the honest framing your procurement team should read together.
Looking back, the arc from /wiki/why-ebpf-changed-the-game through /wiki/bpftrace-for-ad-hoc-tracing and /wiki/parca-pixie-pyroscope has built the case for what eBPF observability gives you; this chapter is the first honest accounting of what it costs you. Both halves are required reading before a procurement call.
References
- Brendan Gregg, "BPF Performance Tools" (Addison-Wesley, 2019), Chapter 3 — the canonical reference on what eBPF can and cannot observe from the kernel side, including the privilege model.
- Cilium Project, "Cilium Network Policy and Threat Model" (
docs.cilium.io/en/stable/security/threat-model) — one of the few vendor docs that publishes its threat model honestly, including blast-radius analysis. - Liz Rice, "Learning eBPF" (O'Reilly, 2023), Chapter 10 (Security and the BPF Verifier) — the case for
CAP_BPF/CAP_PERFMONoverCAP_SYS_ADMINand why kernel 5.8+ matters. - Polar Signals blog, "What does 'agentless' mean for continuous profiling?" (
polarsignals.com/blog/posts/2023/agentless-meaning) — vendor honesty, by the team that built Parca, on why their DaemonSet is still an agent. - KubeCon EU 2024 — "What we learned running eBPF agents in production: a Razorpay case study" — the real-world rollout post that informed the Razorpay merchant-tier story in this chapter.
- CVE-2022-23222 — Linux kernel BPF verifier privilege-escalation vulnerability — the canonical example of why a privileged DaemonSet is a real risk surface, not a theoretical one.
- Grafana Beyla docs, "When to use Beyla vs OpenTelemetry SDKs" (
grafana.com/docs/beyla/latest/when-to-use) — the cleanest official "use both, here is how to decide" doc among 2026 vendors. /wiki/parca-pixie-pyroscope— the previous chapter; the three eBPF continuous-profilers whose marketing this chapter critiques./wiki/why-ebpf-changed-the-game— the foundational chapter on what eBPF actually is, before the marketing layered on top.