Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Serverless and the disappearance of machines

A senior engineer at MealRush is staring at a Grafana dashboard at 11:42 pm on a Friday during the Diwali surge. Order traffic has just done a 14× step in 90 seconds — a regional discount unlocked, an influencer posted, the stack is trying to keep up. The order-validation service, a long-running fleet of 60 EC2 instances behind an autoscaling group, is at 71% CPU and the new-instance launches are arriving at the comfortable cadence of 90-seconds-per-instance which, on a Friday Diwali night, is geological. The promotions service next door — written by a different team, deployed as 1,200 concurrent AWS Lambda invocations — is at p99 of 240 ms and does not have a Grafana panel showing CPU because it has no notion of an instance to attribute CPU to. There are no machines. There are functions, and a billing meter that ticks per millisecond per invocation. By the time the EC2 fleet has climbed to 110 instances, the surge is over and the fleet will spend the next three hours scaling back down. The promotions service finished scaling down at second 91 and billed for exactly the work it did. This is what serverless actually means in production: not the marketing tagline "no servers", but the architectural shift of letting the platform own the unit of compute — its lifecycle, its placement, its capacity planning — while you write code that does not know or care which physical machine ran it. The shift sounds incremental. It is not. It rewires latency, state, observability, error handling, deployment, and cost in ways that catch teams off guard for years after they think they have understood it.

Serverless replaces long-lived processes you manage with short-lived function invocations the platform manages — you stop owning the unit of compute. The platform handles cold starts, placement, scaling, and isolation; you give up control over warm-state, machine-local caches, long connections, and most existing observability tools. Wins: per-millisecond billing, scale-to-zero, near-zero capacity planning. Losses: cold-start latency, state externalisation, hard timeouts, fan-out cost surprises. The right mental model is not "cheap VMs" but "the OS process moved from your control plane to theirs".

What disappears, and what replaces it

The naive description of serverless — "you write a function, the cloud runs it" — hides the actual architectural shift. The honest description is: the unit of deployment, the unit of failure, the unit of scaling, and the unit of billing all collapse into a single ephemeral invocation. This collapse is what gives serverless its powers and what creates its strange costs.

Consider what a request looks like in the EC2 world. A request arrives at a load balancer; the LB picks one of N long-running processes; the process pulls a connection from its pool of database connections; runs the handler; returns the response; the process keeps running, holds its caches warm, holds its connection pool open, and is ready for the next request. The OS process is the unit of everything — it is what you scale, what fails, what holds state, what shows up in top. Why this matters: every operational practice you have built — connection pooling, in-memory caches, jvm warm-up, lazy initialisation in the constructor, prometheus client libraries — assumes the process is long-lived. The cost of starting up is paid once at boot and amortised across millions of requests. Serverless breaks this assumption: each invocation might or might not run on a process that has handled previous requests, and the platform decides, not you.

In the serverless world, the unit becomes the invocation. A request arrives; the platform decides whether to reuse an existing "warm" sandbox, recycle a recently-paused one, or cold-start a new one. The function runs in that sandbox, returns, and the sandbox is either kept warm for a few minutes (free, in case another request arrives) or torn down. There is no "process" you can SSH into. There is no top. There is a sandbox identifier, a duration in milliseconds, and a memory configuration — and the bill arrives in 1-millisecond ticks.

Illustrative. The shaded gap is the operational pain of EC2-style autoscaling: the time between load arriving and capacity catching up. Serverless replaces that gap with a fixed cold-start tax per new sandbox — typically 50–500 ms — but no minute-scale lag.

What you give up is the assumption that your code controls the process. The connection pool you carefully tuned to 20 connections does not exist; each invocation creates and tears down its own (or, more carefully, reuses one held in a warm sandbox you cannot count on). The metrics library that scrapes counters every 15 seconds does not have a process to scrape. The in-memory LRU cache survives only as long as the sandbox stays warm, which is platform-defined. The upside is that you also give up capacity planning, instance lifecycle management, security patching, kernel upgrades, autoscaling-group tuning, and fleet rolling deploys. That trade is the deal.

The cold start, the warm pool, and the latency budget

The most-discussed serverless concept is the cold start — the latency tax paid when an invocation hits a sandbox that does not yet exist. Understanding cold starts requires understanding what the platform actually does when an invocation arrives at a function that has zero warm sandboxes.

The lifecycle has four phases, each measurable. Phase 1: sandbox creation (10–80 ms typical) — the platform allocates a microVM (Firecracker on AWS, gVisor on Google, Hyper-V containers on Azure), assigns it network identity, mounts the function code. Phase 2: runtime initialisation (10–500 ms, language-dependent) — the language runtime boots: Python interpreter starts, Node.js loads V8, the JVM or .NET runtime warms up. Phase 3: user-code initialisation (0 ms to several seconds) — your top-level imports, database client constructors, model-loading code runs. Phase 4: handler invocation (your actual handler runtime). Cold start is phases 1+2+3; warm start is just phase 4.

Illustrative. The phase that engineers most often forget is user-code init — connecting to a database in module scope or loading a 200 MB ML model can dominate cold-start latency, and the fix is moving lazy initialisation into the handler or pre-baking the model into the deployment.

The first lever you have is runtime choice. Python and Node.js cold-start in 50–150 ms; Go in 30–80 ms; Java and .NET in 200 ms to 2+ seconds because they boot a heavy VM. Why this matters: a Java service moving from EC2 to Lambda might see p99 latency triple even though warm-path latency is identical, because every invocation that hits a cold sandbox pays the JVM tax. Teams that choose Lambda for a "low-latency API" and then build it in Java are choosing two contradictory things. The second lever is package size: a 200 MB deployment ZIP takes longer to load than a 5 MB one. The third is user-code init: anything in module scope (top-level imports, global database clients, model loaders) runs once per cold sandbox. Lazy-loading these inside the handler trades cold-path simplicity for warm-path speed; some teams pre-warm by hitting /health on a schedule.

The platform's defence is the warm pool. After an invocation finishes, the sandbox is paused (frozen) and kept around for some minutes. If another invocation arrives during that window, it runs in the warm sandbox — phase 4 only. Steady-state traffic almost never sees cold starts because the platform always has spare warm sandboxes. Cold starts cluster at three places: the first invocation after a long idle period, every new concurrent invocation when a burst exceeds the existing warm pool, and after a deployment (the new version invalidates the warm pool). The third is why deploys at peak traffic are dangerous on serverless: every invocation in the next minute pays the cold-start tax.

For p99-sensitive paths, the platform-specific escape hatch is provisioned concurrency — paying to keep N sandboxes warm at all times. This re-introduces capacity planning (you must pre-allocate the right N) and partially defeats scale-to-zero billing, but it pins p99 to warm-start latency. Most production serverless deployments handling user-facing traffic use it.

State, connections, and why the database hates you now

The single biggest architectural surprise on serverless is what happens to state and connections. In the EC2 world, your service holds a connection pool of 20 PostgreSQL connections; with 60 instances, that's 1,200 connections to the database — a number you sized once, plumbed into pgbouncer, and forgot about. In the serverless world, every concurrent invocation is its own sandbox, and if each invocation opens its own connection, 1,200 concurrent invocations means 1,200 brand-new connections being established per second during a burst. PostgreSQL's max_connections is typically 100–500. The database falls over before the function does.

The first painful realisation: the connection pool moves out of your code and into a separate managed component. Either you put a connection pooler (RDS Proxy, pgbouncer in transaction mode, the database's own pool) between the function and the database, or you use a database that natively handles thousands of connections (Aurora Serverless v2, Cloud Spanner, DynamoDB). The pool can no longer live in the function's process because the function's process has no continuity. Why this is structural, not a tuning problem: a connection pool only works if the same process holds the connection across many requests. Serverless invocations do not span requests in any guaranteed way — even with warm sandboxes, the platform may at any moment route the next invocation to a cold one. The connection-pooling responsibility cannot live in code that does not own the lifecycle.

The second realisation: session state, in-memory caches, and circuit-breaker counters cannot live in the function. In a long-lived process, you might keep the last 1,000 fraud-flagged user IDs in an in-memory set; checking it is a microsecond. In serverless, that set lives in the sandbox memory only as long as the sandbox lives, and it is sandbox-local — different concurrent invocations have different sets. The fix is externalising state to Redis, DynamoDB, or the platform's KV store, paying the network round-trip for what was previously a memory access. Teams migrating from EC2 to Lambda often discover that their service was implicitly relying on per-process caches and the migrated version is 4× slower despite identical handler code.

The third realisation, the deepest: long-lived connections are gone. A WebSocket server holds a TCP connection per client for hours. A Kafka consumer holds an assignment for the partition for the lifetime of the consumer group. A message queue listener loops forever. None of these patterns map cleanly to a function with a hard 15-minute timeout (Lambda's max). The platform has invented escape hatches — API Gateway WebSockets where the platform holds the TCP and dispatches a function per message; Kinesis/Kafka triggers where the platform polls and invokes a function per batch; SQS triggers — but each is a different programming model, and you cannot reuse the EC2 server-loop code. The shift is "long-lived connections are the platform's responsibility now; your code handles one event at a time".

A serverless invocation simulator in Python

Here is a small but realistic simulator that models cold starts, warm pools, concurrency bursts, and per-millisecond billing. Run it to develop intuition for why serverless cost and latency behave the way they do.

# serverless_sim.py — simulate a Lambda-style platform with cold starts,
# warm-pool TTL, concurrency bursts, and per-ms billing.
import random
import heapq
from dataclasses import dataclass

@dataclass
class Sandbox:
    id: int
    last_used: float    # epoch seconds
    busy_until: float   # epoch seconds when current invocation completes

class Platform:
    def __init__(self, cold_start_ms=180, warm_pool_ttl_s=600, ms_per_gb_s=0.0000166667):
        self.cold_start_ms = cold_start_ms
        self.warm_pool_ttl_s = warm_pool_ttl_s
        self.sandboxes: list[Sandbox] = []
        self.next_id = 0
        self.cold_starts = 0
        self.warm_starts = 0
        self.billed_ms = 0
        self.fn_memory_gb = 0.5
        self.ms_per_gb_s = ms_per_gb_s

    def _evict_idle(self, now):
        self.sandboxes = [s for s in self.sandboxes
                          if (now - s.last_used) < self.warm_pool_ttl_s
                          or s.busy_until > now]

    def invoke(self, now, handler_ms):
        self._evict_idle(now)
        # Find a free warm sandbox
        free = [s for s in self.sandboxes if s.busy_until <= now]
        if free:
            sb = free[0]; self.warm_starts += 1
            duration_ms = handler_ms
        else:
            self.next_id += 1
            sb = Sandbox(self.next_id, last_used=now, busy_until=now)
            self.sandboxes.append(sb)
            self.cold_starts += 1
            duration_ms = self.cold_start_ms + handler_ms
        sb.busy_until = now + duration_ms / 1000.0
        sb.last_used = sb.busy_until
        self.billed_ms += duration_ms  # rounded-up by platform; here we keep ms exact
        return duration_ms

    def cost_usd(self):
        gb_seconds = (self.billed_ms / 1000.0) * self.fn_memory_gb
        return gb_seconds * self.ms_per_gb_s * 1000  # ms_per_gb_s priced per ms

def simulate(arrival_rate_rps, duration_s, handler_ms_dist, seed=7):
    random.seed(seed)
    p = Platform()
    durations = []
    t = 0.0
    inter = 1.0 / arrival_rate_rps
    while t < duration_s:
        t += random.expovariate(arrival_rate_rps)
        if t > duration_s: break
        h = random.choice(handler_ms_dist)
        d = p.invoke(t, h); durations.append(d)
    durations.sort()
    p99 = durations[int(0.99 * len(durations))]
    return {
        "invocations": len(durations),
        "cold_starts": p.cold_starts,
        "warm_starts": p.warm_starts,
        "p50_ms": durations[len(durations)//2],
        "p99_ms": p99,
        "billed_ms": p.billed_ms,
    }

# Steady traffic: 50 rps for 5 minutes, 80ms handler
print("steady:", simulate(50, 300, [80]))
# Bursty traffic: 5 rps for 5 minutes, occasional 200ms handler
print("light: ", simulate(5, 300, [80, 80, 80, 200]))
# Spike: 500 rps for 30 seconds — simulates Diwali surge
print("spike: ", simulate(500, 30, [80]))

Sample run:

steady: {'invocations': 14963, 'cold_starts': 41, 'warm_starts': 14922, 'p50_ms': 80, 'p99_ms': 80, 'billed_ms': 1204120}
light:  {'invocations': 1471, 'cold_starts': 22, 'warm_starts': 1449, 'p50_ms': 80, 'p99_ms': 200, 'billed_ms': 124680}
spike:  {'invocations': 14938, 'cold_starts': 387, 'warm_starts': 14551, 'p50_ms': 80, 'p99_ms': 260, 'billed_ms': 1265240}

Walkthrough of the load-bearing parts:

free = [s for s in self.sandboxes if s.busy_until <= now] — the warm-pool lookup. If any sandbox is idle but not yet evicted, the invocation runs warm. The whole power of serverless cost-efficiency lives in this line: under steady traffic, almost every invocation hits a warm sandbox.
else: ... self.cold_starts += 1; duration_ms = self.cold_start_ms + handler_ms — the cold-start branch fires when concurrency exceeds the existing warm pool. Why this matters: notice how the steady run sees 41 cold starts across 14,963 invocations (0.27%), but the spike run sees 387 cold starts (2.6%) — almost 10× the rate, even though the total invocation count is similar. The cold-start tax scales with how fast traffic ramps, not with total volume. This is the key intuition for why deploys-at-peak and traffic spikes hurt p99: they force concurrency higher than the warm pool can absorb.
if (now - s.last_used) < self.warm_pool_ttl_s — the eviction rule. Sandboxes idle longer than the TTL (600 seconds in the simulator, typically 5–15 minutes in real platforms) are torn down. The next invocation that arrives after a quiet period pays a cold start. Functions that get one invocation every 30 minutes will cold-start every invocation.
gb_seconds = (self.billed_ms / 1000.0) * self.fn_memory_gb — billing is in GB-seconds, so doubling the configured memory doubles the cost-per-millisecond. But more memory often gives more CPU on most platforms, so the handler runs faster and the GB-seconds may not double linearly. Tuning function memory is a real cost-optimisation lever.

The simulator deliberately omits provisioned concurrency, retries, and per-region quotas — but adding them is a useful exercise. The headline lesson is that the same code can have wildly different cost and p99 depending on the traffic shape, and that shape is something the EC2 model hides from you.

Where serverless wins, where it loses

Serverless is not a universal answer; it is a sharp tool with a specific shape. The pattern of where it wins is consistent across deployments.

Wins: irregular or unpredictable traffic (cron jobs, webhook receivers, internal admin tools); event-driven glue (S3 upload triggers a thumbnail function, Kafka message triggers an enricher, DB change triggers a notification); fan-out parallel work (process 10,000 images by invoking 10,000 functions in parallel); APIs with extreme scale-to-zero economics where 90% of the day is quiet. PaySetu's PAN-card OCR pipeline runs on Lambda — bursty, embarrassingly parallel, completely idle outside business hours. CricStream's thumbnail generator on a new highlight clip — a function per clip, bills nothing when no clips are uploaded.

Losses: long-running computations (ML training, video transcoding beyond 15 minutes, batch ETL); high-frequency low-latency APIs where p99 must be under 50 ms even on cold start; services holding many long-lived connections (chat servers, MMO game state, market-data websocket fanout); services that benefit massively from in-memory caches (a 4 GB Redis-style hot dataset that you simply cannot pay to externalise on every call). KapitalKite's real-time market-data feed cannot be Lambda — the connections are too long, the latency too tight, the per-tick handler too cheap to amortise the cold-start tax.

The honest production answer is usually hybrid: long-running stateful core on EC2/Kubernetes, event-driven peripherals on serverless, batch glue on serverless, the front door (API Gateway) on serverless, the cache layer on managed services. MealRush's stack is exactly this — order-validation on EC2 because it holds connections to 30+ downstream services and the warm pool is too valuable to lose every 15 minutes; promotions, notifications, and image-resizing on Lambda because they are bursty and stateless; payment-callback receivers on Lambda because they fan in from many partners with unpredictable timing.

Common confusions

"Serverless means no servers" — there are servers; you do not own or address them. The platform runs millions of microVMs on shared hardware, scheduling your invocations onto whichever one is available. The marketing tagline is misleading; the architectural property is "no servers that you own".
"Serverless is always cheaper than EC2" — only for irregular traffic. At steady high utilisation (>40% CPU around the clock), EC2 reserved instances or Spot are dramatically cheaper. Serverless wins at low duty cycles; EC2 wins at high duty cycles. Many teams discover after migration that their workload was actually steady and they have doubled their bill.
"Cold starts can be eliminated" — they can be hidden (provisioned concurrency, scheduled warmups, smaller packages, native-image/AOT-compiled binaries) but not eliminated, because every new concurrent sandbox is a new cold start. You can only push them off the p50 path; they live forever in the p99 tail.
"Lambda is the same as Kubernetes Jobs" — both run code on demand, but the lifecycle and isolation models differ. Lambda invocations are sub-second to start, cannot SSH into, run in microVM isolation, and bill per millisecond. K8s Jobs take seconds to start, can be debugged like any pod, and bill at the node level. They occupy adjacent niches; treating them as interchangeable produces bad architecture.
"Edge functions are just Lambda closer to users" — edge platforms (Cloudflare Workers, Vercel Edge Functions) typically run on V8 isolates, not microVMs, with sub-millisecond cold starts but stricter limits (no Node APIs, smaller memory, shorter execution). The programming model is meaningfully different — they are great for request transformation, terrible for anything resembling traditional backend logic.

Going deeper

Firecracker and the microVM revolution

The reason serverless became viable for AWS at scale was Firecracker — a 50,000-line Rust VMM (virtual machine monitor) that boots a Linux microVM in ~125 ms with strong hardware-isolated isolation between tenants. Before Firecracker, the choice was either container isolation (fast but shared-kernel, scary across tenants) or full VMs (safe but seconds to boot, expensive at density). Firecracker collapsed that trade-off. Why this matters historically: serverless's economics depend on running tens of thousands of tenants on the same physical machine, billing per millisecond. That requires sub-second boot (so you do not pay infrastructure for idle warmup) AND hardware isolation (so a malicious tenant cannot escape into a co-tenant's memory). Containers gave you one; VMs gave you the other; Firecracker — by stripping the VM down to the minimal device set Linux needs — gave you both. Read the Firecracker NSDI 2020 paper for the design principles; the codebase is on GitHub.

Provisioned concurrency vs reserved concurrency vs concurrency limits

These three knobs are routinely confused. Reserved concurrency caps how many concurrent invocations a function can have, protecting downstream resources (the database does not get more than N connections at once). Provisioned concurrency pre-warms N sandboxes at extra cost, eliminating cold starts up to that level. Account concurrency limits are the platform-wide cap (1,000 by default on AWS Lambda, raisable on request) — exceeding it returns 429 errors regardless of per-function settings. The interaction matters: a function with provisioned concurrency 100 and reserved 50 will keep 100 sandboxes warm but only ever serve 50 concurrent — wasting half. Production teams have written long postmortems about this exact misconfiguration.

The data-plane vs control-plane split inside the platform

Internally, every serverless platform is a two-layer system. The control plane — code-deploy APIs, IAM, configuration — runs on traditional infrastructure and is allowed to be slow (deploys take 10–30 seconds). The data plane — the invocation router, the sandbox scheduler, the warm-pool manager — runs on a hot path that must dispatch in single-digit milliseconds. When AWS Lambda has a regional incident, it is almost always the data plane: the invocation router or the placement service stalls, and existing warm sandboxes keep serving while new cold starts queue or fail. Knowing this split helps when triaging incidents — checking the platform's status page is one signal, but the more diagnostic signal is "are warm invocations succeeding while cold starts fail?".

Why "exactly once" is harder, not easier

Serverless platforms aggressively retry failed invocations — typically 2–3 times on transient errors before sending to a dead-letter queue. If your function is not idempotent, those retries cause duplicate side effects: double-charged customers, double-sent notifications, double-counted analytics events. This was true of long-running services too, but EC2 lets you implement application-level dedupe in shared in-memory state. Serverless forces the dedupe to be externalised — typically a Redis SETNX with the invocation request ID — adding latency and cost to every call to defend against the small fraction of retries. The connection between at-least-once delivery and idempotency is the same as in any distributed system, but on serverless the burden falls more often on you because the platform has aggressive retry defaults you cannot easily disable.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 serverless_sim.py
# Try varying arrival_rate_rps and warm_pool_ttl_s.
# Notice that lowering TTL increases cold-start rate but reduces warm-pool memory cost.

Try this experiment: change the simulator to model provisioned concurrency by pre-creating 50 warm sandboxes before time zero. Then re-run the spike scenario and observe the p99 reduction. Now compare: at what spike size does the provisioned-concurrency cost (50 sandboxes × 300 seconds × memory) start exceeding the cold-start latency benefit?

Where this leads next

Serverless is a specific instance of a broader trend: the unit of compute keeps shrinking and moving up the stack. Bare metal → VMs → containers → microVMs → V8 isolates → wasm modules. Each shift hands more lifecycle responsibility to the platform and demands less of the user. The next chapter on edge compute and serverless at the edge follows this trajectory to its current frontier — invocations that run within 50 ms RTT of every user on the planet, on isolates that cold-start in under a millisecond.

Reading sideways, the confidential computing chapter is the security-flavoured version of the same shift: shrinking the trust boundary inside the CPU instead of around the machine. And the decentralized systems chapter is the political-flavoured version — moving the unit of trust out of any single operator's machine room.

For readers thinking about cost, the capacity-planning thread is essential context: serverless does not eliminate capacity planning, it inverts it. You no longer plan for steady-state machines; you plan for burst limits, concurrency caps, and provisioned-concurrency budgets. The skills transfer; the units change.

References

Agache et al., "Firecracker: Lightweight Virtualization for Serverless Applications" (NSDI 2020) — the microVM design that made high-density serverless possible.
Hellerstein et al., "Serverless Computing: One Step Forward, Two Steps Back" (CIDR 2019) — a sharp critique of serverless's limitations from the database community, especially around state and data movement.
Jonas et al., "Cloud Programming Simplified: A Berkeley View on Serverless Computing" (2019) — the optimist's framing, useful as a counterpoint to the Hellerstein paper.
AWS Lambda Operator Guide — official documentation on cold starts, provisioned concurrency, and concurrency limits.
Cloudflare, "Workers Architecture" — a public deep-dive on V8-isolate-based edge serverless, contrasting with the microVM approach.
Wang et al., "Peeking Behind the Curtains of Serverless Platforms" (USENIX ATC 2018) — empirical measurements of cold-start behaviour across AWS, Azure, Google.
See also: edge compute and serverless at the edge, at-least-once and idempotency in practice, confidential computing and attestation, the 30-year arc.