Kubernetes services and endpoints
It is 19:40 IST on the night of a CricStream live final. The score-overlay deployment is in the middle of a rolling update — eight new pods coming up, eight old pods terminating, all behind a single Service called score-overlay with cluster-IP 10.96.42.17. The 12,000 client pods opening connections to that IP do not know — and do not need to know — that the membership behind it is changing every two seconds. The cluster-IP did not move. The pod IPs that back it did. Somewhere on every node in the cluster, a small Go binary called kube-proxy is rewriting iptables rules every few hundred milliseconds, watching the EndpointSlice objects in the API server change, and translating those changes into kernel packet-rewriting rules. The reason 12,000 clients still get their score updates is that this rewriting happens fast enough — and atomically enough per-rule — that the gap between an old pod terminating and its IP being removed from the iptables NAT chain is, on average, under 200 milliseconds.
A Kubernetes Service is not an address — it is a controller-managed indirection. The Service object has a stable virtual IP (the ClusterIP); the Endpoints / EndpointSlice object has the actual pod IPs. A controller (the endpoints controller) watches Pod readiness and rewrites EndpointSlice objects; kube-proxy on every node watches EndpointSlice and rewrites iptables / IPVS / nftables rules. Discovery is two watch loops in series, with the API server's etcd as the single source of truth.
What a Service object actually is
A Service in Kubernetes is a small declarative object stored in etcd. It does not run anywhere — there is no "Service process". It is a specification that something else (the endpoints controller, kube-proxy, CoreDNS) reads and acts on. The user writes:
apiVersion: v1
kind: Service
metadata:
name: score-overlay
namespace: cricstream
spec:
selector:
app: score-overlay
ports:
- port: 80
targetPort: 8080
type: ClusterIP
That YAML, once kubectl apply'd, becomes a row in etcd at /registry/services/specs/cricstream/score-overlay. It does two independent things:
-
The Service controller in
kube-controller-managerallocates a virtual IP from the configured Service CIDR (e.g.10.96.0.0/12) and writes it into the Service object'sspec.clusterIP. This IP belongs to nothing — no NIC has it, no router announces it. It exists only as a number that other components agree to translate. -
The endpoints controller (also in
kube-controller-manager) watches Pod objects whose labels match the Service'sselector(app: score-overlay), filters for those whosestatus.conditions[type=Ready] == True, and writes their pod-IP-and-port pairs into a separate object calledEndpointSlice(or, on older clusters,Endpoints). Every time a pod becomes Ready, every time a pod is deleted, every time a readiness probe flips, the EndpointSlice is rewritten.
The Service object is the stable name. The EndpointSlice is the changing membership. Together they implement the discovery primitive — but neither one routes a single packet on its own.
Why the control plane and the data plane are separated this way: if every packet had to consult a userspace daemon (the original kube-proxy design, pre-1.2, did exactly this and was abandoned), the daemon becomes the bottleneck — its socket-accept loop and CPU scheduler determine the cluster's latency floor. By moving the routing decision into the kernel as iptables / IPVS / nftables rules, the per-packet cost drops to a few hundred nanoseconds and kube-proxy only needs to run when the membership changes, which is rare relative to packet traffic. This is the same separation Linux uses everywhere: control-plane in userspace, data-plane in the kernel.
EndpointSlices, kube-proxy, and the rewrite loop
Until Kubernetes 1.17, the membership object was a single Endpoints resource per Service. For a Service backing 10,000 pods this was a single etcd row that had to be rewritten every time any one of those pods went Ready or NotReady — and every kube-proxy on every node had to receive the entire 10,000-entry blob over the watch stream. A single pod restart could push gigabytes of API-server bandwidth across a large cluster. CricStream's 2019-era platform team documented exactly this failure mode in an internal post-mortem: a single deploy of a 4,000-replica Service took the API server's network interface to 100% saturation for forty minutes, and dependent Services across the cluster experienced 8–30 second discovery freshness gaps for the duration.
EndpointSlice (GA in 1.21) shards the membership across multiple objects, each capped at 100 endpoints by default. A 10,000-pod Service is now ~100 EndpointSlices; a single pod going Ready rewrites one of them; kube-proxy receives the diff for one slice, not the full 10,000. The control-plane bandwidth cost dropped by roughly two orders of magnitude.
The EndpointSlice object looks like this:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: score-overlay-x7g2k
labels:
kubernetes.io/service-name: score-overlay
addressType: IPv4
ports:
- name: http
port: 8080
protocol: TCP
endpoints:
- addresses: ["10.244.7.8"]
conditions: { ready: true, serving: true, terminating: false }
nodeName: node-3
zone: ap-south-1a
- addresses: ["10.244.9.4"]
conditions: { ready: true, serving: true, terminating: false }
nodeName: node-7
zone: ap-south-1b
# ... up to 100 endpoints per slice, sharded across multiple slice objects
The three-bit conditions field is the key. ready means "include in normal load-balancing"; serving means "still capable of handling requests" (true even during graceful shutdown); terminating means "deletionTimestamp is set, this pod is going away". kube-proxy uses ready for round-robin selection and serving to honour in-flight requests through graceful shutdown — a critical fix that landed in 1.22 to stop rolling deploys from connection-resetting clients mid-request.
# kube_proxy_simulation.py — minimal sketch of what kube-proxy does on each EndpointSlice change
# Simulates the watch → diff → iptables-rewrite loop for one Service
import time, random, queue, threading
class FakeApiServerWatch:
"""Yields (op, slice) tuples — what a real client-go informer would deliver."""
def __init__(self): self.q = queue.Queue()
def stream(self):
while True:
yield self.q.get()
class FakeKubeProxy:
def __init__(self, watch):
self.watch = watch
self.endpoints = {} # service_name -> set(ip:port)
self.iptables_rules = [] # what we'd push to the kernel
def reconcile(self, svc, eps):
old = self.endpoints.get(svc, set())
new = {f"{e['ip']}:{e['port']}" for e in eps if e["ready"]}
added, removed = new - old, old - new
if added or removed:
t0 = time.time()
self._rewrite_chain(svc, new)
print(f"[{time.strftime('%H:%M:%S')}] svc={svc:14s} +{len(added)} -{len(removed)} "
f"=> rules={len(self.iptables_rules):3d} apply={1000*(time.time()-t0):.1f}ms")
self.endpoints[svc] = new
def _rewrite_chain(self, svc, eps):
# In real iptables mode: -t nat -N KUBE-SVC-...; one DNAT rule per endpoint with --probability
self.iptables_rules = [f"-A KUBE-SVC-{svc} -m statistic --probability {1/len(eps):.4f} "
f"-j DNAT --to-destination {ep}" for ep in sorted(eps)] if eps else []
def run(self):
for op, slc in self.watch.stream():
self.reconcile(slc["service"], slc["endpoints"])
# Drive a fake control plane: a deploy that adds 4 pods, then kills 2, then adds 1
watch = FakeApiServerWatch()
kp = FakeKubeProxy(watch)
threading.Thread(target=kp.run, daemon=True).start()
ips = [f"10.244.7.{i}" for i in range(1, 6)]
watch.q.put(("UPDATE", {"service": "score-overlay", "endpoints":
[{"ip": ips[0], "port": 8080, "ready": True}]}))
time.sleep(0.05)
for i in range(1, 4):
watch.q.put(("UPDATE", {"service": "score-overlay", "endpoints":
[{"ip": ip, "port": 8080, "ready": True} for ip in ips[:i+1]]}))
time.sleep(0.05)
# kill two
watch.q.put(("UPDATE", {"service": "score-overlay", "endpoints":
[{"ip": ip, "port": 8080, "ready": True} for ip in ips[2:4]]}))
time.sleep(0.05)
# add one back
watch.q.put(("UPDATE", {"service": "score-overlay", "endpoints":
[{"ip": ip, "port": 8080, "ready": True} for ip in ips[2:5]]}))
time.sleep(0.5)
Sample run:
[19:40:01] svc=score-overlay +1 -0 => rules= 1 apply=0.2ms
[19:40:01] svc=score-overlay +1 -0 => rules= 2 apply=0.3ms
[19:40:01] svc=score-overlay +1 -0 => rules= 3 apply=0.3ms
[19:40:01] svc=score-overlay +1 -0 => rules= 4 apply=0.4ms
[19:40:01] svc=score-overlay +0 -2 => rules= 2 apply=0.3ms
[19:40:01] svc=score-overlay +1 -0 => rules= 3 apply=0.4ms
Per-line walkthrough. The line new = {f"{e['ip']}:{e['port']}" for e in eps if e["ready"]} is where kube-proxy filters by readiness — endpoints with ready=False are excluded from the load-balancing pool but kept in the iptables chain only via the serving flag (the simulation simplifies this). The line added, removed = new - old, old - new computes the diff against the previous reconciliation — kube-proxy only rewrites rules that changed, never the whole chain (a 10,000-rule rewrite would block the kernel for tens of milliseconds). The line -m statistic --probability {1/len(eps):.4f} is the actual iptables idiom — equal-probability random selection across endpoints, evaluated in the kernel for every packet. self.endpoints[svc] = new caches the last-seen state per Service so the next event computes a diff, not a full rewrite. In production, kube-proxy uses informer caches (the client-go library) to do this efficiently across thousands of services.
Why iptables uses --probability rather than round-robin: iptables rules are evaluated sequentially per packet — there is no shared counter across rules. To get equal load-balancing, each rule's --probability must equal the conditional probability that this rule fires given that no prior rule did. For three endpoints, the probabilities are 1/3, 1/2, 1 — kube-proxy computes these and writes them in. A round-robin counter would require the kernel to maintain shared state per Service (which is exactly what IPVS and nftables do, with their own connection trackers — and exactly why IPVS scales better than iptables past ~5,000 Services).
Service types and the four flavours of "external"
The Service object carries a type field that selects between four implementations of the same indirection. They differ in what makes the cluster-IP (or the equivalent) externally reachable.
ClusterIP is the default. The IP comes from the cluster's Service CIDR (10.96.0.0/12 by default in kubeadm). It is reachable only from inside the cluster — pods on any node can connect to it, but a packet from outside the cluster has nothing that knows what to do with that destination. Most internal microservices use this.
NodePort allocates a port (default range 30000–32767) on every node in the cluster and tells kube-proxy to forward node-IP:nodePort → cluster-IP → backend pod. This makes the Service reachable from outside the cluster as long as you can hit any node's IP on the right port. It is the simplest external exposure but pushes the load-balancing problem outward — the caller must pick which node IP to hit, usually via an external load balancer or DNS round-robin.
LoadBalancer asks the cluster's cloud provider integration (the cloud-controller-manager) to provision an actual external load balancer (an AWS NLB, a GCP TCP/UDP LB, an Azure Load Balancer) that fronts the NodePort and exposes a public IP. The cluster object is created; the cloud LB is provisioned out-of-band; the public IP is written back into the Service's status.loadBalancer.ingress field. This is how a stock type: LoadBalancer Service becomes a 40.x.y.z IP your DNS can A-record.
ExternalName is a DNS shim — the Service has no IP and no endpoints; it is just a CNAME that CoreDNS will return when something looks up myservice.default.svc.cluster.local. Useful for migrating from external services into the cluster (today's db.svc.cluster.local resolves to legacy-db.paysetu.local; tomorrow you switch the ExternalName to a ClusterIP backed by an in-cluster Postgres).
The fifth, less-discussed mode is headless: spec.clusterIP: None. There is no virtual IP, no kube-proxy rule. Instead, CoreDNS returns all the pod IPs as A records when you query score-overlay.cricstream.svc.cluster.local. This is what StatefulSets use — Cassandra, Kafka, Postgres replicas — when the client wants to know each pod individually rather than load-balance across them.
Failure modes — what breaks and how
Stale endpoints during rolling deploy. A pod is killed; the kubelet patches its readiness to False; the API server records the change; the endpoints controller's informer cache delivers the event; the controller rewrites the EndpointSlice; every kube-proxy's informer cache delivers that event; every kube-proxy rewrites its iptables chain. That is six hops. The end-to-end p99 latency is roughly 200–500 ms on a healthy cluster, and 5–30 seconds on a cluster where the API server is overloaded or kube-proxy is CPU-throttled. During that window, packets DNAT to a pod that is no longer accepting connections — the client sees ECONNREFUSED or, worse, a TCP RST mid-request. The fix is graceful pod termination with terminationGracePeriodSeconds ≥ propagation latency + drain time, plus a preStop hook that sleeps long enough to let the EndpointSlice update propagate before the process exits.
Service-CIDR exhaustion. The Service CIDR is sized at cluster creation and is hard to change. A /12 (10.96.0.0/12) gives ~1M addresses; a /24 gives 254. PaySetu hit this on a multi-tenant platform cluster where every tenant created their own namespaces and ~30 Services per tenant — at 8,000 tenants the cluster ran out of cluster-IPs, new Service creates failed with Internal Server Error: failed to allocate a serviceIP: range is full. The fix required a control-plane outage and a CIDR resize via kube-apiserver --service-cluster-ip-range.
iptables-rule explosion. With 5,000 Services and 50 endpoints each, you have 250,000 iptables rules per node. Every packet traverses the chain linearly — even with --probability rules, the kernel evaluates each rule until one matches. P99 packet-routing latency degrades from microseconds to milliseconds. This is the reason IPVS mode (--proxy-mode=ipvs) and nftables mode (--proxy-mode=nftables, GA in 1.31) exist — both use hash-based lookup that is O(1) in the number of endpoints rather than O(N).
API-server overload from EndpointSlice churn. A flapping pod (failing readiness probe every 5 seconds) generates a constant stream of EndpointSlice updates. At 1 update/second across 200 flapping pods, the API server is rewriting EndpointSlices at 200 writes/second — all of which fan out as watch events to every kube-proxy on every node. CricStream traced exactly this: 200 ms p99 connect-latency degraded to 4 seconds when their readiness probe was tuned too aggressively (1 s timeout for an HTTP path that occasionally took 1.2 s under GC).
Why fixing the readiness probe is sometimes safer than fixing the API server: API-server scaling means more replicas, larger etcd, more memory — all of which need careful capacity planning and risk a multi-hour cluster reboot. Tuning the probe (raise the timeout from 1 s to 3 s, raise failureThreshold from 1 to 3) takes a kubectl edit and rolls out in seconds. The right instinct on cluster overload from EndpointSlice churn is "what is generating the churn" before "how do I scale the receiver". This is a general distributed-systems principle: backpressure on the source is cheaper than capacity on the sink.
Common confusions
-
"The ClusterIP belongs to a load balancer." It belongs to nothing. There is no NIC with that IP, no router that announces it. It exists only as a number that iptables / IPVS / nftables on every node knows how to rewrite. If you
tcpdumpfor the cluster-IP on the wire between two nodes, you will not find it — it is rewritten on the source node before the packet leaves. -
"kube-proxy proxies the traffic." It does not, except in the ancient userspace mode (deprecated since 1.2). In the iptables/IPVS/nftables modes it is purely a controller — it watches the API server and writes kernel rules. Once the rules are in place, the kernel does all the packet rewriting; kube-proxy can crash and traffic keeps flowing (with a stale rule set) until a Service membership change happens.
-
"Endpoints and EndpointSlice are the same thing." They are two API objects representing the same data. Endpoints (v1, legacy) is one object per Service with all backends in a single list. EndpointSlice (discovery.k8s.io/v1) is sharded — many objects per Service, ≤100 backends each. Modern controllers write only EndpointSlice; the EndpointSlice controller mirrors them back into legacy Endpoints for backwards compatibility, but new clients should watch EndpointSlice directly.
-
"A Service is up as long as its ClusterIP responds to ping." ClusterIPs do not respond to ping — they have no network stack of their own. The way to test a Service is to
curlit from a pod; even then, a 200 OK only proves one endpoint is healthy, not all of them. Real liveness checks require querying the API server's EndpointSlice and counting Ready endpoints. -
"
type: LoadBalanceris the only way to expose a Service externally." Ingress controllers (nginx, Traefik, Istio gateway) typically run as a single LoadBalancer Service that fronts many internal Services via HTTP routing. For HTTP traffic this is dramatically cheaper — one cloud-LB invoice instead of one per Service. Each cloud-LB in AWS costs roughly ₹1,800–2,400/month before traffic, so a fleet of 200 internet-facing Services costs ₹4–5 lakh/month if you naively give each a LoadBalancer Service. -
"Headless Services don't need kube-proxy." Correct — but they do need CoreDNS, and CoreDNS itself watches the API server for EndpointSlice changes. The watch chain is shorter (no iptables) but the staleness window is the same.
Going deeper
How CoreDNS turns a Service name into an answer
Inside the cluster, every pod has /etc/resolv.conf pointing at the CoreDNS Service IP (typically 10.96.0.10). When a pod looks up score-overlay.cricstream.svc.cluster.local, CoreDNS — which is itself a deployment running pods, fronted by a Service, watching the API server — performs a series of plugins. The kubernetes plugin matches the .svc.cluster.local zone, looks up the Service by namespace and name in its in-memory cache (populated by API-server watches), and returns the cluster-IP as the A record. For a headless Service, the same plugin returns the pod IPs from the EndpointSlice. The kubernetes plugin's cache is updated by the same EndpointSlice watch loop that kube-proxy uses; freshness is comparable. For SRV records (used by StatefulSets and gRPC name resolvers), CoreDNS additionally returns the per-port targets — _grpc._tcp.score-overlay... returns 0 0 8080 score-overlay-0.score-overlay.cricstream.svc.cluster.local.
IPVS vs iptables vs nftables — when each wins
iptables-mode kube-proxy was the default from 1.2 to 1.30. It works because every cluster ships with iptables already; rules are simple linear chains. The pain starts at scale: at ~5,000 Services per cluster, the rule-list traversal becomes a measurable per-packet cost (low milliseconds). IPVS mode (introduced in 1.11, beta 1.11, GA 1.21) replaces the linear chain with a hash table and a connection tracker; the load-balancing decision becomes O(1) regardless of cluster size, and IPVS supports richer algorithms (least-connection, source-hash, never-queue). The catch: IPVS still uses iptables for some auxiliary rules (NodePort masquerade, source-IP preservation), and the dual-mode operation has been a source of subtle bugs. nftables mode (alpha in 1.29, GA in 1.31) is the long-term replacement — same constant-time lookups as IPVS, but using the modern in-kernel nftables packet classifier instead of iptables/ipvs, with cleaner rule-set semantics and faster rule reloads (nftables can swap atomically; iptables must rewrite incrementally).
Topology-aware routing and internalTrafficPolicy
When a Service is fronted by 100 pods spread across 3 zones, sending a request from a pod in zone ap-south-1a to a backend in ap-south-1c costs an extra 1–2 ms RTT and an inter-AZ data-transfer charge (₹0.012/GB on most cloud providers, ₹3.6 lakh/month for a 250 Mbps cross-AZ stream). Topology-aware routing (originally service.kubernetes.io/topology-aware-hints: Auto, now stable as trafficDistribution: PreferClose in 1.31+) tells kube-proxy to prefer backends in the same zone as the calling pod, falling back to other zones only when the local zone is empty. The implementation is per-EndpointSlice "hints" that the EndpointSlice controller computes — each endpoint declares which zones it prefers to serve, and each kube-proxy uses only the hints that include its own zone. The win at CricStream scale (8 zones, 4,000 pods) was 30% reduction in inter-AZ traffic and a 0.8 ms drop in p50 service-call latency.
Reproduce this on your laptop
# Spin up a 3-node kind cluster, deploy a Service, watch the EndpointSlice
brew install kind kubectl # or your distro's package manager
kind create cluster --config <(cat <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes: [{role: control-plane}, {role: worker}, {role: worker}]
EOF
)
kubectl create deployment score-overlay --image=nginx --replicas=4
kubectl expose deployment score-overlay --port=80 --target-port=80
kubectl get endpointslice -l kubernetes.io/service-name=score-overlay -o yaml
# Now watch what happens during a rolling restart:
kubectl rollout restart deployment score-overlay &
kubectl get endpointslice -l kubernetes.io/service-name=score-overlay -w
# And inspect the kube-proxy iptables chain on a node:
docker exec -it kind-worker iptables -t nat -L KUBE-SERVICES -n | head
Where this leads next
The Service abstraction lifts pod IPs out of the application's mental model — the app developer writes http://score-overlay/health, not http://10.244.7.8:8080/health. But it does not solve every discovery problem. Client-side vs server-side discovery takes up the next question: even within Kubernetes, should the load-balancing decision happen in the kernel of the calling node (server-side, as kube-proxy does), in the calling client process (client-side, as a service mesh's sidecar does), or somewhere in between? Each choice changes the failure modes and the observability story.
The natural extension within the platform itself is the service mesh — Istio, Linkerd, Cilium — which replaces (or layers on top of) kube-proxy with a per-pod sidecar that owns the load-balancing decision in userspace, with much richer policy (mTLS, retries, circuit breaking, traffic shifting). The mesh's data plane is L7-aware in a way iptables can never be; the cost is a sidecar per pod, doubling the per-request CPU.
References
- "Service" — Kubernetes documentation — the canonical reference for Service types, selectors, and the lifecycle of an EndpointSlice.
- "EndpointSlices" — Kubernetes documentation — sharding rationale, the
conditionsfield, and topology-aware hints. - Mengnan Gong, "Scaling Kubernetes Networking with EndpointSlices" — Kubernetes blog 2020 — the design doc for why Endpoints became EndpointSlice; includes the 5,000-pod scaling numbers.
- Laura Lorenz et al., "Kubernetes 1.31: nftables proxy mode is generally available" — Kubernetes blog 2024 — the nftables data-plane proxy mode and benchmarks vs iptables/IPVS.
- Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes, "Borg, Omega, and Kubernetes" — ACM Queue 2016 — the lineage of the Service / Endpoints abstraction from Borg through to Kubernetes.
- Bowei Du, "How does the Kubernetes networking work?" — kccnceu2018 talk recording — a maintainer walking through kube-proxy's iptables rule generation in detail.
- Consul, etcd, ZooKeeper — internal companion. Reading them back-to-back makes the trade clear: Kubernetes is a coordination service (etcd) with a higher-level API (Service objects) and a per-node cache-and-rewrite agent (kube-proxy).
- DNS-based discovery — internal companion. CoreDNS turns a Service name into an answer; this chapter shows what backs that answer.