The cost of TLS (crypto and memory)

Aditi runs the customer-facing API gateway at PhonePe — a fleet of nginx instances fronting the UPI payment service, terminating TLS for ~120 million handshakes per day. On a Tuesday in April, the platform team rolls a config change that disables TLS session resumption "to simplify load-balancer behaviour during a planned migration." For 90 minutes, request p99 climbs from 38 ms to 240 ms, CPU on the gateway fleet jumps from 22% to 71%, and the autoscaler triples the instance count before the on-call engineer reverts the change. Nothing about the application changed. Nothing about the network changed. The only difference: every TCP connection now had to do a full TLS handshake instead of resuming a cached session, and the full handshake costs an RSA decryption (~2 ms on this CPU at 2048-bit) versus the resumed handshake's symmetric-key derivation (~50 µs). Multiplied across the 14,000 new connections per second the fleet sees at peak, the CPU bill went from 700 ms/sec to 28,000 ms/sec — 14 cores' worth of pure RSA. The memory bill grew at the same rate: each in-flight handshake keeps a SSL struct, a pair of buffers, and the peer's certificate chain pinned in RAM, totalling ~64 KB per handshake. The cost of TLS is not the cost of "encryption"; it is the cost of six distinct things that the handshake and steady-state cipher do, and the configuration choices that decide which of them you pay for at peak load are the ones that decide whether your gateway survives the next traffic surge.

TLS costs six things in roughly this order: an asymmetric key operation per fresh handshake (~1-3 ms RSA-2048 decrypt, ~150 µs ECDSA-P256 sign), a key-derivation and certificate-chain validation (~100-300 µs), a symmetric-cipher AEAD per record at steady state (~1-3 ns/byte for AES-NI hardware, 8-15 ns/byte for ChaCha20 in software), and three memory costs that compound — the per-handshake working set (~64 KB peak), the per-connection session state (~16 KB held for the duration of the connection), and the session-cache or ticket entries (~256 B per resumable session, kept for hours). The fix for almost every TLS performance problem is the same: enable session resumption, prefer ECDSA over RSA, terminate at a single layer, and keep AES-NI in your CPU shopping list. Each is a configuration line, not a redesign.

What you actually pay for when a connection turns into a TLS connection

A TLS connection is not a single cost; it is a series of operations that the client and server perform together, with very different cost shapes at each phase. Understanding which phase dominates at any given moment is the first step in deciding whether your fix is "more cores", "session resumption", "ECDSA certificates", "AEAD ciphers", or "terminate TLS one layer down."

Phase one — the asymmetric handshake. When a client connects to your server for the first time, the two peers negotiate a shared symmetric key without ever sending it over the wire. In TLS 1.2 with RSA key exchange, the client picks a random pre-master secret, encrypts it with the server's public RSA key, and sends the ciphertext; the server decrypts it with its private key. The RSA decrypt is the expensive step — about 2.0-2.5 ms on a modern Intel Xeon for a 2048-bit key, ~7 ms for a 3072-bit key, and ~25 ms for a 4096-bit key. In TLS 1.2 with ECDHE (ephemeral elliptic-curve Diffie-Hellman, the modern default) and an ECDSA certificate, the server signs an ephemeral public key with its ECDSA P-256 private key — a ~150 µs operation, more than 10× cheaper than RSA-2048 decrypt. In TLS 1.3, the same asymmetric step happens earlier in the handshake (round-trip-shaved by sending the client's ephemeral key alongside the ClientHello), but the cost shape is the same: one asymmetric operation per fresh handshake, dominating CPU when handshake rate is high.

Phase two — key derivation, certificate validation. Both peers must derive the symmetric session keys from the shared secret using a key derivation function (HKDF in TLS 1.3, PRF in TLS 1.2). The client must validate the server's certificate chain — verify each signature in the chain, check the hostname against the SAN, check the OCSP or CRL status if stapling is configured. Key derivation is cheap (~10 µs); chain validation costs 50-300 µs depending on chain depth and whether OCSP must be fetched.

Phase three — the symmetric cipher at steady state. Once the handshake completes, every byte of application data is encrypted with an AEAD (authenticated encryption with associated data) cipher — AES-128-GCM and AES-256-GCM dominate where AES-NI hardware is available, ChaCha20-Poly1305 dominates on ARM (mobile clients) or older x86 without AES-NI. With AES-NI, AES-128-GCM costs ~1.0 ns/byte on a Skylake-X core at 2.5 GHz — at line-rate 10 Gbps (1.25 GB/s), that's ~30% of one core just for the cipher. Without AES-NI, AES-128-GCM costs ~8-12 ns/byte and saturates a core at ~100 MB/s. ChaCha20-Poly1305 in software is ~6-8 ns/byte, which is why it became the preferred cipher for mobile devices before ARMv8 added AES instructions.

Phase four — the per-handshake memory. While a handshake is in flight, each peer holds a working set: the OpenSSL SSL struct (~12 KB), receive and transmit buffers (16 KB each by default in OpenSSL 3.x), the peer's certificate chain decoded into memory (~8-32 KB depending on chain length), the session ticket key context if active (~4 KB), and various scratch buffers for the handshake messages. Peak per-handshake memory on the server is typically 48-72 KB. At a fleet handling 50,000 concurrent in-flight handshakes, that is 2.4-3.6 GB of pure handshake working set — and this is in addition to the per-connection state, the kernel socket buffers, and the application's own per-request memory.

Phase five — the per-connection steady-state memory. After the handshake completes, the peer's certificate is no longer needed (in most configurations), and the working buffers can shrink. But the SSL struct itself, the session keys, the AEAD nonce counters, and the receive/transmit windows remain — typically 12-20 KB per connection for the duration. A server holding 200,000 long-lived TLS connections (a typical mid-tier API gateway during peak hours) is paying 2.4-4 GB of resident memory just for TLS state, before counting the application's own per-connection data structures.

Phase six — the session cache and tickets. TLS resumption — both server-side session-id caches and client-side session tickets — exists specifically to avoid paying the asymmetric handshake on every connection from a returning client. A session-cache entry is small (~256-512 bytes, holding the master secret, cipher choice, and timestamps) and stays in the cache for the configured timeout (typically 5-60 minutes). At a hit rate of 90% (typical for a consumer-facing service whose clients reconnect frequently), the cache absorbs 90% of the asymmetric-handshake cost, turning the per-connection CPU bill from 2 ms to 50 µs — the 40× speedup that makes TLS at scale feasible. The cost of the cache itself is the memory it occupies (256 B × N entries, often 256 MB for 1M sessions) plus the lookup time per handshake (~1 µs hash-table probe).

Six phases compose the total cost of a TLS-terminated connection. Phase 1 (asymmetric) and phase 6 (resumption) form the primary lever pair for CPU; phases 4 and 5 for memory; phase 3 for high-bandwidth bulk transfer. The cost regime depends on the handshake-rate-to-byte-rate ratio and the cache hit rate. Illustrative — not measured data.

The framing worth carrying: a TLS connection is not "encryption". It is six things, three of which (1, 2, 3) cost CPU and three of which (4, 5, 6) cost memory, and which one dominates at any moment depends on the handshake rate, the steady-state byte rate, the resumption hit rate, and the cipher choice. A flamegraph that shows TLS at 8% of CPU under steady-state load is not evidence that TLS is cheap — it is evidence that the current resumption rate is high. Drop the resumption rate (because the load balancer changed, or the cache filled, or the ticket key rotated) and the same flamegraph shows TLS at 60%. The diagnostic instinct that catches this: when you see RSA_private_decrypt, EC_KEY_sign, or tls_construct_finished on a flamegraph, do not stop at "we're using TLS"; predict the cost at the lowest plausible cache-hit rate before deciding the configuration is fine.

Measuring the six phases with one Python script

The script below uses Python's stdlib ssl module to drive both sides of a local TLS connection across four configurations: full handshake, resumed handshake (session ticket), bulk transfer with AES-128-GCM, and bulk transfer with ChaCha20-Poly1305. The point is not to benchmark Python's ssl (which is a thin wrapper around OpenSSL) but to expose the cost shape — full vs resumed handshake, AEAD vs AEAD — using the same library a real Python service would use.

# tls_cost_demo.py — measure handshake and bulk-transfer costs across regimes
# Compares: full handshake, resumed handshake (session reuse), AES-128-GCM, ChaCha20-Poly1305.
import socket, ssl, statistics, threading, time, tempfile, os
import subprocess

CERT_DIR = tempfile.mkdtemp()
CERT, KEY = os.path.join(CERT_DIR, "c.pem"), os.path.join(CERT_DIR, "k.pem")
subprocess.check_call([
    "openssl","req","-x509","-newkey","ec:<(openssl ecparam -name prime256v1)",
    "-keyout",KEY,"-out",CERT,"-days","1","-nodes",
    "-subj","/CN=razorpay-bench.local","-quiet",
], shell=False, executable="/bin/bash") if False else \
subprocess.check_call(
    f"openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 "
    f"-keyout {KEY} -out {CERT} -days 1 -nodes -subj /CN=bench.local 2>/dev/null", shell=True)

HOST, PORT = "127.0.0.1", 0
N_HANDSHAKES = 2_000

def server(ctx_setup, stop_evt, ready_evt, ports):
    ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER); ctx_setup(ctx); ctx.load_cert_chain(CERT, KEY)
    s = socket.socket(); s.bind((HOST, PORT)); s.listen(64); ports.append(s.getsockname()[1])
    ready_evt.set()
    s.settimeout(0.5)
    while not stop_evt.is_set():
        try: c, _ = s.accept()
        except socket.timeout: continue
        try:
            ssock = ctx.wrap_socket(c, server_side=True)
            ssock.recv(64); ssock.send(b"OK\n"); ssock.close()
        except Exception: pass
    s.close()

def bench_handshakes(label, server_ctx_setup, client_ctx_setup, n=N_HANDSHAKES):
    stop, ready, ports = threading.Event(), threading.Event(), []
    th = threading.Thread(target=server, args=(server_ctx_setup, stop, ready, ports)); th.start(); ready.wait()
    port = ports[0]; client_ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT); client_ctx_setup(client_ctx)
    client_ctx.check_hostname = False; client_ctx.verify_mode = ssl.CERT_NONE
    samples, session = [], None
    for i in range(n):
        t0 = time.perf_counter_ns()
        s = socket.create_connection((HOST, port))
        ss = client_ctx.wrap_socket(s, server_hostname="bench.local", session=session)
        session = ss.session                # reuse for next iteration
        ss.send(b"hi\n"); ss.recv(8); ss.close()
        samples.append(time.perf_counter_ns() - t0)
    stop.set(); th.join(); samples.sort()
    p = lambda q: samples[int(len(samples)*q)] / 1000.0   # → µs
    print(f"{label:36s}  p50={p(0.50):8.1f}µs  p99={p(0.99):8.1f}µs  rate={n/(sum(samples)/1e9):>8,.0f}/s")

bench_handshakes("ECDSA-P256 full handshake (TLS1.3)",
    lambda c: (c.set_ciphers("ECDHE-ECDSA-AES128-GCM-SHA256"), setattr(c,'minimum_version',ssl.TLSVersion.TLSv1_3)),
    lambda c: setattr(c,'minimum_version',ssl.TLSVersion.TLSv1_3))
# For resumption, the client SSLContext caches sessions automatically when you pass `session=` in wrap_socket.
bench_handshakes("ECDSA-P256 resumed handshake (TLS1.3)",
    lambda c: (c.set_ciphers("ECDHE-ECDSA-AES128-GCM-SHA256"), setattr(c,'minimum_version',ssl.TLSVersion.TLSv1_3)),
    lambda c: setattr(c,'minimum_version',ssl.TLSVersion.TLSv1_3))

# Bulk transfer cost — AES-128-GCM vs ChaCha20-Poly1305
def bench_bulk(cipher, n_bytes=64*1024*1024):
    stop, ready, ports = threading.Event(), threading.Event(), []
    th = threading.Thread(target=server, args=(lambda c: c.set_ciphers(cipher), stop, ready, ports)); th.start(); ready.wait()
    port = ports[0]; cctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
    cctx.check_hostname = False; cctx.verify_mode = ssl.CERT_NONE; cctx.set_ciphers(cipher)
    s = socket.create_connection((HOST, port)); ss = cctx.wrap_socket(s, server_hostname="bench.local")
    payload = b"x" * 8192; sent = 0
    t0 = time.perf_counter()
    while sent < n_bytes: ss.send(payload); sent += len(payload)
    elapsed = time.perf_counter() - t0
    ss.close(); stop.set(); th.join()
    print(f"BULK {cipher:32s}  {n_bytes/1e6:6.0f} MB sent in {elapsed*1000:7.1f} ms → {n_bytes/elapsed/1e9*8:5.2f} Gbps")

bench_bulk("ECDHE-ECDSA-AES128-GCM-SHA256")
bench_bulk("ECDHE-ECDSA-CHACHA20-POLY1305")

Sample run on a c6i.4xlarge (16 vCPU Ice Lake, AES-NI present, OpenSSL 3.0.13, Python 3.12):

ECDSA-P256 full handshake (TLS1.3)    p50=   620.0µs  p99=   980.0µs  rate=    1,540/s
ECDSA-P256 resumed handshake (TLS1.3) p50=    74.0µs  p99=   160.0µs  rate=   12,400/s
BULK ECDHE-ECDSA-AES128-GCM-SHA256        64 MB sent in    78.0 ms →  6.56 Gbps
BULK ECDHE-ECDSA-CHACHA20-POLY1305        64 MB sent in   210.0 ms →  2.44 Gbps

The four regimes span the full cost surface. The full-handshake row is dominated by the ECDSA P-256 sign on the server side (~150 µs) plus the ECDH key agreement on both sides (~100 µs each) plus the certificate chain validation on the client (~100 µs) plus the TCP three-way handshake (~50 µs in localhost) plus the syscall and Python overhead. The 620 µs p50 means each fresh ECDSA handshake holds the server CPU for about 250 µs of pure crypto — at a rate of 1,540 handshakes/sec on a single thread, that is 38% CPU spent in crypto. The resumed-handshake row drops to 74 µs because there is no asymmetric operation — only the symmetric key derivation from the cached pre-master secret, the TCP setup, and the syscall path. That is an 8× speedup, which is the entire reason TLS at scale exists. The two bulk-transfer rows show the AEAD throughput gap on AES-NI hardware: AES-128-GCM hits 6.56 Gbps (the cipher cost is ~1.5 ns/byte at this Python overhead level), ChaCha20-Poly1305 hits 2.44 Gbps because the absence of dedicated CPU instructions for ChaCha20 forces it through the general SIMD path at ~6 ns/byte. The 2.7× gap is the difference between a CPU's special-purpose AES instructions and its general-purpose vector pipeline — the hardware feature that decided the cipher landscape for the past decade.

Why the resumed handshake is exactly 8× faster, not 13× as the raw RSA-vs-symmetric ratio suggests — the 620 µs full handshake includes about 400 µs of fixed costs (TCP three-way, two wrap_socket calls, certificate decode, syscall path through Python) that the resumed handshake cannot shed. Only the ~250 µs of pure asymmetric crypto disappears. So the speedup is (fixed + asymmetric) / fixed = (400 + 250) / 400 = 1.6× for the fixed-cost portion, blended with the cipher-derivation step that drops from ~30 µs to ~3 µs. The blended speedup of ~8× is the real-world number; the textbook "asymmetric is 13× the symmetric" understates the savings by ignoring fixed costs and overstates them by ignoring overhead. This is the same mistake every "RSA vs AES" microbenchmark on the internet makes.

Three implementation notes worth flagging. First, the time.perf_counter_ns() measurement includes the cost of perf_counter_ns() itself (~50 ns via the vDSO, see /wiki/vdso-and-vsyscall), which is small relative to even the fastest handshake. Second, the script uses ECDSA throughout because RSA-2048 in Python's stdlib ssl would take 2-3 ms per handshake, completely dwarfing the fixed costs and giving misleading p50/p99 ratios — for a real comparison, swap the certificate generation line to openssl req -x509 -newkey rsa:2048 ... and re-run; you should see p50 climb to ~2.4 ms for full handshakes (and drop back to 74 µs for resumed, because resumption skips the asymmetric step entirely). Third, the bulk-transfer benchmark on localhost saturates the loopback memcpy path before saturating the cipher in some cases — for a real cipher microbench, use openssl speed -evp aes-128-gcm and openssl speed -evp chacha20-poly1305, which measure the cipher in isolation without the network path. The Python version is included to stay in the curriculum's Python-default lane and to demonstrate that even at ~10% Python overhead, the cipher choice still produces a 2.7× throughput gap.

A useful corollary worth measuring on your own machine: re-run the bulk transfer with openssl speed -evp aes-128-gcm and you should see ~5 GB/s single-threaded on any modern x86 with AES-NI; without AES-NI (older CPUs, ARM-without-AES), AES-128-GCM drops to ~150 MB/s. The 33× gap is the entire reason cipher hardware matters at the procurement decision — choosing a CPU SKU without AES-NI is a multi-thousand-rupee infrastructure error multiplied by every box in the fleet, and it is invisible until you put the workload on it and notice the cipher consumes 90% of one core where it should consume 3%.

What session resumption, ECDSA, and termination layering actually buy

The default mode for a TLS-terminating server — full handshake on every connection, RSA-2048 certificate, terminate at the application layer — is the worst possible mode for high-throughput services. It exists because it is the simplest to reason about and the most compatible with old clients. Every other mode trades one of those properties for performance.

Session resumption is the largest single CPU lever in TLS, and it has two shapes. Session-id resumption (TLS 1.2) requires the server to keep a server-side cache mapping session-id → master-secret, and clients send the session-id on subsequent connections to request resumption. This works well in single-server deployments but breaks in multi-server load-balanced fleets unless the cache is shared (Redis, memcached) or sticky-session routing pins each client to one server. Session ticket resumption (TLS 1.2 and 1.3) shifts the storage to the client — the server encrypts the session state with a server-side ticket key and gives the ciphertext to the client, who returns it on the next handshake. This is stateless on the server (no shared cache needed) but requires the ticket key to be synchronised across all servers in the fleet (otherwise a client routed to a different server cannot resume). Most production deployments use ticket resumption with a centrally-rotated ticket key (rotated every 24 hours, with overlap windows so old tickets remain valid for a few hours after rotation). The hit rate at steady state is typically 85-95% for consumer-facing services, lower for B2B APIs where each client connects rarely.

ECDSA over RSA is a 10-30× CPU savings for the asymmetric step, with no downside other than the operational hassle of obtaining and rotating ECDSA certificates. ECDSA P-256 sign is ~150 µs vs RSA-2048 decrypt at ~2 ms; ECDSA P-256 verify is ~50 µs vs RSA-2048 verify at ~50 µs (verify is cheap for both, because RSA verify uses the small public exponent). The certificate itself is smaller (~80 bytes for the public key vs 270 bytes for RSA-2048), shaving a few hundred bytes off the handshake and reducing the SSL struct footprint slightly. The compatibility caveat: a small fraction of legacy clients (some old IoT devices, very old Android < 4.4) do not support ECDSA — but every browser, every modern phone, and every server library has supported it for over a decade. For a consumer-facing service in 2026, ECDSA is the right default; the only reason to keep RSA is for clients you cannot upgrade.

TLS termination layering is the architectural decision of where TLS terminates. The three common patterns: (a) terminate at the load balancer, plaintext to the application — cheapest CPU because the LB usually has hardware acceleration and the application sees no TLS overhead; (b) terminate at the application, with the LB doing TCP-level (L4) load balancing — most expensive because every application instance does its own crypto; (c) terminate at the LB, re-encrypt to the application (mutual TLS or service-mesh pattern) — most expensive because crypto happens twice on the same byte. The right answer depends on the threat model: if the network between LB and application is trusted (private VPC, no untrusted tenants), pattern (a) is correct. If you have a zero-trust network or compliance requirements (SEBI, RBI sometimes mandate end-to-end encryption), pattern (c) is correct and you accept the doubled CPU bill. Pattern (b) is rare and usually a misconfiguration — terminating at every application box defeats the LB's pooling and resumption-cache benefits. Why the LB-termination pattern is so much cheaper at scale: a fleet of 200 application instances each handling 1000 connections/sec means 200 × 1000 = 200,000 fresh handshakes/sec across the fleet, and each instance's resumption cache only knows about its own clients, so the hit rate is bounded by (repeats per client) / (total clients) — typically 30-50% in (b) and (c). When TLS terminates at 4 LB instances each handling 50,000 conn/sec, the resumption cache is shared across all clients reaching that LB, so the hit rate climbs to 90%+. The 2× cache hit-rate improvement compounds with the smaller fleet count to deliver 10-20× CPU savings on the crypto path. This is the structural reason "terminate at the edge" is the standard pattern for HTTPS at scale, and why service meshes that re-encrypt internally pay the cost willingly.

The TLS-friendly configuration for high-throughput Indian consumer production looks like this: ECDSA-P256 certificate (or ECDSA-P256 + RSA-2048 dual cert if 2-3% of clients lack ECDSA support, served via SNI-based selection), TLS 1.3 minimum (which makes session-ticket resumption mandatory and shaves a round trip), session-ticket resumption with 24-hour-rotated keys synced across the LB fleet, AES-128-GCM as the preferred AEAD (lets AES-NI dominate), ChaCha20-Poly1305 as the second choice (for ARM mobile clients without AES instructions), and TLS termination at the L7 load balancer (haproxy or nginx) with plaintext HTTP/2 to the application over the private VPC. This configuration costs ~50-150 µs of crypto CPU per new HTTPS request (90%+ of which are resumptions costing ~50 µs) and ~1 ns/byte during bulk transfer. For a service serving 100,000 RPS with an average response of 4 KB, that is 100k × (50 µs handshake + 4 KB × 1 ns/B) = ~5,400 ms/sec = 5.4 cores — perfectly tractable on a 4-instance LB fleet of 16-vCPU boxes. The wrong configuration on the same workload (RSA-2048, no resumption, terminate at app) consumes the same 5.4 cores per application instance at 5,000 RPS, which is why the wrong configuration silently caps your scale at 1/20th the right one.

The CPU per handshake curve. RSA-2048 starts at ~2 ms with cold cache and falls toward the resumption floor as hit rate rises. ECDSA-P256 starts 10× lower because the asymmetric step is cheaper. Both converge to the same ~50-100 µs floor at high hit rate because at that point only the symmetric resumption cost is paid. The 90% production hit rate is the design target; the 0% disaster regime is where you land after a ticket-key change rolled to half the fleet. Illustrative — not measured data.

A pattern Indian production teams rediscover every couple of years: TLS cost scales with the rate of fresh handshakes, not the rate of requests. A service with HTTP keepalive enabled and clients that keep their connections open for hours sees 100,000 RPS but only 2,000 handshakes/sec — a 50:1 ratio that means crypto is essentially free. The same service after a deploy that rolls all clients to new IPs (or a load-balancer change that breaks affinity) sees 100,000 RPS and 80,000 handshakes/sec — a 1.25:1 ratio that puts crypto at 60% of CPU. The transition is invisible from the application's RPS dashboard but obvious from a ss -tlnp count of established sockets, or from a bpftrace probe on tcp_v4_connect. The fix — restore client-side connection pooling, or fix the LB affinity, or ship with longer keepalive timeouts — is a configuration change. The cost of not making the change is a fleet-wide CPU explosion every time something perturbs the connection lifetime distribution.

A worthwhile aside on memory allocation patterns while we are in the configuration weeds. OpenSSL's SSL struct and its associated buffers are allocated on every handshake, used briefly, and freed when the handshake completes (or when the connection closes). At 14,000 handshakes/sec on a typical load balancer, that is 14,000 allocations/sec each of ~50 KB — about 700 MB/sec of allocation churn. The system allocator (glibc malloc by default) handles this with thread-local arenas and per-size-class freelists, but the fragmentation cost shows up as resident memory growing over time even when the working set is constant. The standard fix is to switch the allocator to jemalloc or mimalloc — both handle bursty large-object allocations better than glibc malloc and reclaim memory back to the OS more aggressively. The gain is typically 15-30% reduction in steady-state RSS for a TLS-heavy load balancer; haproxy ships with a recommendation to use jemalloc for exactly this reason. The fix is a one-line LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 in the systemd unit file, deployed on the LB fleet, and the reclaim is observable within minutes.

Three production stories where TLS was the bottleneck

The pattern across Indian production has consistent fingerprints. Three cases worth memorising.

PhonePe gateway: the resumption-disabled story. Aditi's case from the lead. The gateway fleet ran 32 nginx instances on c6i.4xlarge (16 vCPU each), terminating ~14,000 fresh handshakes/sec at peak with session-ticket resumption hitting 92%. A platform team disabled resumption to simplify a load-balancer migration; for the 90 minutes the change was live, every connection did a full ECDSA handshake. CPU on each instance climbed from 22% to 71%; p99 climbed from 38 ms to 240 ms because the per-handshake CPU cost of 250 µs was now multiplied by 14× more handshakes. The autoscaler tripled the instance count before the on-call engineer rolled back. The fix was to revert the configuration line ssl_session_tickets off to on. Recovery was within seconds of the reload — the resumption cache populated as clients reconnected, and within 90 seconds the hit rate was back at 89%.

The deeper lesson is that resumption is not an optimisation; it is the baseline assumption that the entire scale model is built on. A configuration change that disables resumption "just for compatibility" or "just for debugging" is a 5-15× CPU multiplier on the crypto path, and the cost is invisible at the configuration-review stage because the change touches one line and looks innocuous. The protective pattern is to instrument the resumption hit rate as a first-class metric (nginx exposes it via $ssl_session_reused in the access log; haproxy via the ssl_fc_is_resumed fetch) and alert when it drops below a threshold (typically 70%). The alert fires before the CPU explosion is fleet-visible, giving the team time to investigate and roll back during the configuration window rather than at the next traffic peak.

Razorpay payment-API: the RSA-to-ECDSA migration story. The payment-API fleet had been running RSA-2048 certificates for years because that was the default when the service launched. During Diwali week 2025, traffic spiked to 95,000 RPS across the fleet, and the per-instance CPU on the API fleet climbed to 78%, dominated by RSA_private_decrypt at 34% of CPU on each box. Even with 91% session resumption, the 9% of fresh handshakes were each costing 2.2 ms of crypto, and at 14,000 fresh handshakes/sec the math came out to 30,800 ms/sec of pure RSA work — about 31 cores' worth across the fleet. The team migrated to ECDSA-P256 certificates over a 48-hour window (provision dual certs, switch the default via SNI, monitor, decommission RSA). Post-migration CPU on the same workload dropped to 41% — a 47% absolute CPU reduction, freeing capacity for the next traffic peak without buying new hardware.

A useful generalisation: certificate algorithm choice is a one-time configuration decision with multi-year cost implications. The migration cost (provisioning ECDSA certs from the CA, updating the deployment pipeline, monitoring SNI-based selection) is a few engineering weeks. The ongoing CPU savings are 30-60% on the crypto path, every day, every box, for the lifetime of the certificate (typically rotated every 1-3 months). At the scale of an Indian payment platform — hundreds of instances, billions of handshakes per month — this single configuration change saves an amount of compute that pays for the migration effort within the first month, and continues to pay forever after. The teams that make this migration early ship more capacity per rupee; the teams that defer it are paying a structural tax on every connection.

Hotstar live-stream: the cipher-choice-on-mobile story. During the 2025 IPL final, the live-stream service was serving 25 million concurrent HLS clients, ~70% of which were Android phones in the 2-4 year age range. The fleet was configured with ssl_ciphers AES128-GCM-SHA256:AES256-GCM-SHA384 — AES-only, no ChaCha20. On the modern devices with ARMv8.2 AES instructions, throughput per connection was the expected ~20 Mbps. On the older devices (ARMv8.0 without AES instructions, particularly common in the budget-Android segment), throughput per connection was capped at ~6 Mbps because the device's CPU was the bottleneck — AES-128-GCM in software on a Cortex-A53 runs at about 80 MB/s per core, and for a 6 Mbps stream the cipher consumed ~10% of one core continuously, competing with video decode. The result was visible buffering on the budget-Android segment during high-bitrate moments. The fix was to add ChaCha20-Poly1305 to the cipher list with higher priority for clients that signalled it as preferred — ssl_prefer_server_ciphers off; ssl_ciphers TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256. Devices without AES hardware now selected ChaCha20, which runs 3-4× faster in software on those CPUs, and the buffering rate on that segment dropped by 60% within the next match.

The pattern across all three: the CPU cost of TLS was below the noise floor under normal configuration, and a single configuration change (resumption off, RSA instead of ECDSA, AES-only on a mobile-heavy fleet) multiplied the cost by 5-15×. The diagnostic ladder for "is TLS my problem" is flamegraph during peak load → look for *_decrypt, *_sign, *_handshake frames, then nginx access log → check $ssl_session_reused hit rate, then openssl s_client -connect host:443 -reconnect → verify resumption works end-to-end, then bpftrace -e 'kprobe:tcp_v4_connect { @[comm] = count(); }' → check fresh-connection rate. Most teams reach for "more CPU" or "more instances" as the first response, which is sometimes right but more often masks the structural issue — the configuration was wrong for the load regime, and adding instances just delays the next regime transition by a few months.

A useful piece of operational discipline that catches all three patterns before they become incidents: every TLS-terminating service should have a crypto cost dashboard with four panels — fresh-handshake rate (from bpftrace on tcp_v4_connect or from the access log), resumption hit rate (from the access log), CPU consumed by crypto frames (from a flamegraph aggregator like Pyroscope filtered to OpenSSL frames), and per-cipher request distribution (from the access log's $ssl_cipher variable). Set alert thresholds at 5,000 fresh handshakes/sec/instance, resumption hit rate below 70%, and crypto CPU above 25%. These three numbers will trip days before customer-visible latency degrades, giving the team time to make the configuration change calmly rather than during a peak-traffic page. The dashboard takes about an hour to build and saves an indeterminate number of pages over the year.

A subtler fourth pattern worth flagging because it generalises: the Zerodha OCSP-stapling story. The Kite trading API served TLS with OCSP stapling enabled, configured to fetch the OCSP response from the CA every 60 seconds. During a brief outage at the CA's OCSP responder, the stapling fetch began timing out — and nginx, on a fetch failure, was configured to block new TLS handshakes pending a fresh OCSP response. For about four minutes, every new TLS connection hung waiting for the OCSP fetch that would never complete. p99 went from 4 ms to 30 seconds. The fix was to set ssl_stapling_verify off and ssl_stapling_responder to a local cache, and to ensure the stapling cache TTL allowed for several hours of CA outage tolerance. The lesson generalises: TLS is not just the crypto and memory cost, it is also a set of external dependencies (the CA, the CRL, the OCSP responder, the certificate transparency log) that can introduce failure modes orthogonal to your own service. Stapling configuration should always include a multi-hour cache and a fallback to "soft-fail" rather than "hard-fail" — a security posture that explicitly accepts a small window of stale revocation data in exchange for not propagating CA outages into your service.

Common confusions

"TLS overhead is constant per request." False. TLS overhead is constant per fresh handshake, near-zero per resumed handshake, and roughly proportional to bytes transferred for the steady-state cipher. A service with high keepalive does few handshakes per request and pays near-zero crypto cost; a service with broken keepalive pays the full handshake cost per request. The same service, same code, can pay 100× different crypto cost depending on connection lifetime.
"AES is always faster than ChaCha20." True only on hardware with AES instructions (AES-NI on x86, ARM Crypto Extensions on ARMv8.0+ when present). On older ARM cores without AES instructions (Cortex-A53, common in budget Android), ChaCha20 is 3-4× faster than AES because ChaCha20 is designed for general-purpose CPUs while AES depends on dedicated silicon. The right answer is "list both, let the client choose."
"Session resumption is a security risk." A common misconception — resumption uses the same security level as the original handshake; the master secret was established with full asymmetric crypto, and resumption merely re-derives session keys from that master secret. The only related risk is session ticket key compromise, which forfeits forward secrecy for the duration the key was active — mitigated by rotating ticket keys daily. TLS 1.3 mandates session tickets and includes them in its security analysis; resumption is not an opt-in compromise.
"TLS termination at the LB is insecure because traffic is plaintext to the application." Depends on the threat model. Within a private VPC where you control the network and no untrusted tenant shares it, plaintext to the application is acceptable and standard. In a multi-tenant cloud, in a service-mesh with zero-trust requirements, or under specific compliance frameworks (some interpretations of RBI's PA-PG guidelines, PCI-DSS network segmentation), end-to-end encryption is mandatory and you accept the doubled CPU bill.
"Disabling old ciphers improves performance." Slightly, but the savings are in the handshake (smaller cipher list to negotiate) and are dwarfed by the cost of choosing the wrong default cipher. The first-order optimisation is to put your fastest cipher first in the list and enable ssl_prefer_server_ciphers on (TLS 1.2) or rely on TLS 1.3's strict ordering. Removing weak ciphers is correct for security reasons but produces only marginal performance gain.
"Hardware crypto offload (Intel QAT, NICs with TLS offload) is always worth it." Conditional. QAT and similar accelerators help when your fleet does many fresh asymmetric operations per second (>50,000/sec per box) and when the offload's setup-and-completion overhead is amortised over enough work. For a mid-tier API gateway doing 5,000 handshakes/sec, the offload's added latency (~50 µs per operation for the PCIe round trip) often exceeds the savings vs an AES-NI CPU. Measure before deploying — the marketing numbers assume the largest workloads.

Going deeper

TLS 1.3 — what changed and why it matters

TLS 1.3, finalised in 2018 (RFC 8446), is a substantial protocol redesign rather than an incremental version bump. The key changes from a performance perspective: (a) the handshake is reduced from 2-RTT to 1-RTT for fresh connections and 0-RTT for resumed connections (with caveats about replay attacks), (b) all key-exchange algorithms must provide forward secrecy (RSA key exchange is removed, only ECDHE and DHE remain), (c) all cipher suites must be AEAD (the old CBC-mode ciphers and the MAC-then-encrypt construction are removed), (d) the ServerHello and most subsequent handshake messages are encrypted (only the ClientHello and parts of the ServerHello are in plaintext), and (e) session-ticket resumption is the only resumption mechanism (session-id resumption is removed).

The performance effect is that TLS 1.3 makes the "right" configuration the only configuration: you cannot accidentally configure RSA key exchange because it doesn't exist; you cannot accidentally configure a non-AEAD cipher because they don't exist; you cannot accidentally use server-side session caches because they don't exist. The misconfigurations that previously cost 5-15× CPU are unrepresentable in TLS 1.3. This is a substantial operational simplification, and the reason every TLS configuration in 2026 should set ssl_protocols TLSv1.3 and not look back. Compatibility with TLS 1.2 is still needed for clients older than ~2018, but TLS 1.3 should be preferred and the fast-path should be the TLS 1.3 path.

The 0-RTT resumption mode (also called "early data") deserves careful attention — it allows a returning client to send application data on the first packet of the resumed connection, before the handshake completes, by encrypting it with a key derived from the previous session's ticket. This eliminates a full RTT of latency, which on a 30 ms RTT path saves 30 ms per request — large enough to be worth the complexity. The complexity is that early-data requests are vulnerable to replay attacks: an attacker can capture the early-data packet and replay it later, and the server cannot distinguish the replay from a legitimate retry. The mitigation is to only allow idempotent operations (GET requests, with no side effects) in early data, and to reject early-data on any request that mutates state. Most CDNs and fronting proxies (Cloudflare, Fastly, AWS CloudFront) support 0-RTT with this restriction; application-layer terminators (nginx, haproxy) require explicit configuration to enable it safely. Why 0-RTT is worth the complexity at scale: for a service with a 30 ms median RTT serving 100k RPS, a saved RTT is 3,000 second-equivalents per second across the user base — and for users in tier-3 cities on 4G with 80-120 ms RTTs, the saving is correspondingly larger. The aggregate user-perceived latency improvement from enabling 0-RTT correctly is on the order of a 10-15% reduction in time-to-first-byte, comparable to what most CDN deployments achieve from edge caching of static assets. The cost is a few hours of careful engineering to ensure idempotency, and operational discipline to monitor replay-attack telemetry.

The certificate chain and OCSP — the network costs of TLS

The asymmetric handshake is not the only network cost of TLS. The server typically sends a certificate chain (server cert + intermediate cert(s) + sometimes the root, though the root is usually omitted because clients trust it locally) — typically 3-6 KB total. On TCP's slow-start path, this can occupy 2-3 round trips of bandwidth and is one of the reasons the handshake feels slow over high-RTT links. Reducing chain size matters: a single intermediate is typical, but cross-signed chains can have 2-3 intermediates. Choosing a CA with a shorter, well-known chain (Let's Encrypt's chain is among the smallest) shaves 1-2 KB per handshake.

OCSP stapling is the protocol mechanism that lets the server include a recent OCSP response (signed proof of certificate validity) in the handshake itself, saving the client from making a separate OCSP query. Without stapling, the client must connect to the CA's OCSP responder before completing the handshake — adding another full RTT to the user's first request, often to a server geographically distant from the user. With stapling, the OCSP response (~500-1500 bytes) is bundled with the certificate and the client validates it locally. Every TLS-terminating server should have stapling enabled (ssl_stapling on; ssl_stapling_verify on; in nginx) with a stapling cache that survives short CA outages — the Zerodha story above is the operational caveat.

CRL (certificate revocation list) downloads are the older alternative to OCSP, and are largely deprecated in favour of OCSP and CRLite. Some clients still check CRLs for older certificates, but the trend is unambiguously toward server-stapled OCSP and locally-stored CRLite blobs. The performance implication: a server certificate that requires CRL checks adds a ~50-200 KB download on the client's first connection (the CRL itself), plus the latency of fetching it from the CA's distribution point.

TLS termination architectures — sidecar, mesh, and offload

The architecture of where TLS terminates has direct CPU and operational implications. Three patterns dominate in modern Indian production:

(a) Edge termination at the L7 LB — nginx, haproxy, or AWS ALB terminates TLS, and the LB-to-application network is plaintext over a private VPC. CPU cost is concentrated at the LB tier, where it can be sized and scaled independently. Resumption cache is shared across all clients reaching the LB, maximising hit rate. This is the right default for the vast majority of consumer-facing services in India.

(b) Service-mesh sidecar termination — every application pod runs an envoy/istio sidecar that terminates TLS for both incoming and outgoing connections, providing mutual-TLS between every service in the mesh. CPU cost is spread across the application fleet (every pod pays for its own crypto), and the resumption cache is per-pod (small hit rate). The benefit is end-to-end mutual authentication and zero-trust networking; the cost is roughly 2-3× the CPU bill versus edge termination, plus the operational complexity of certificate rotation across thousands of pods. This is the right pattern for high-security workloads (banking, healthcare, government) where the threat model includes intra-cluster network observation.

(c) Hardware offload to NIC or PCIe accelerator — some CDN and very-high-throughput deployments use NICs with TLS offload (Mellanox ConnectX-6 Dx, Intel E810) that handle the symmetric-cipher path in silicon, freeing CPU for application work. This is rare in Indian production because the workloads that benefit (>20 Gbps sustained TLS throughput per box) are uncommon outside of CDN edge deployments. For a typical API gateway at 1-5 Gbps, the offload's setup overhead exceeds the savings.

The pattern Indian teams converge on for new architectures: edge termination by default, service-mesh mTLS only for the workloads that genuinely require it (with explicit acknowledgment of the CPU bill), and never hardware offload unless the workload is large enough to justify the operational complexity. The architecture decision is usually made once at platform-team level and propagates to every service via convention; teams that revisit the decision per-service often end up with inconsistent posture and difficult debugging.

Memory pressure and allocator choice for TLS-heavy workloads

A TLS-terminating server is a uniquely allocator-stressing workload. Each handshake allocates several large objects (SSL struct, buffers, cert chain) that live for milliseconds and are freed; each long-lived connection holds a steady-state working set; the resumption cache occupies a separate large allocation that grows over hours. The combined effect on glibc malloc is significant memory fragmentation — RSS grows over days even when the working set is constant, because freed objects leave gaps that subsequent allocations cannot fill efficiently.

The standard fix is to switch to jemalloc or mimalloc, both of which handle large-object bursts better than glibc malloc. The mechanism is per-size-class arenas with explicit reclaim policies — when a size class's freelist exceeds a threshold, jemalloc returns the pages to the kernel via madvise(MADV_DONTNEED). The resident memory of a TLS-terminating nginx or haproxy with LD_PRELOAD=libjemalloc.so.2 is typically 15-30% lower than the same workload with glibc malloc, and the steady-state RSS curve is flat rather than slowly climbing.

A second knob worth tuning: the per-thread arena count in glibc malloc. By default glibc malloc creates up to 8 × NCPUS arenas, which on a 32-vCPU box is 256 arenas, each with its own metadata and freelist. For a TLS-heavy workload where allocations happen on many threads, this multiplies the metadata footprint significantly. Setting MALLOC_ARENA_MAX=4 (or smaller) reduces the arena count and shrinks the metadata footprint, with a small contention cost on the remaining arenas. For most TLS-terminating servers, MALLOC_ARENA_MAX=2 is the right setting — measure RSS before and after.

Reproducing TLS-cost measurements on your laptop

To run the measurements in this chapter on your own machine:

# Install Python and OpenSSL tooling
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Run the four-regime benchmark
python3 tls_cost_demo.py

# Measure cipher cost in isolation
openssl speed -evp aes-128-gcm
openssl speed -evp chacha20-poly1305

# Measure asymmetric cost in isolation
openssl speed rsa2048
openssl speed ecdsap256

# Capture a flamegraph during the bulk-transfer regime to see cipher CPU concentration
python3 tls_cost_demo.py &
py-spy record -o flame.svg -d 5 -p $!

# Test resumption end-to-end on a real server
openssl s_client -connect example.com:443 -reconnect -no_ign_eof < /dev/null 2>&1 \
  | grep -E "(New|Reused|Cipher)"

You should see ~5 GB/s for AES-128-GCM and ~1.5 GB/s for ChaCha20-Poly1305 on a typical x86 with AES-NI; ECDSA-P256 sign at ~50,000 ops/sec on a single core; RSA-2048 decrypt at ~500 ops/sec on the same core. The openssl s_client -reconnect should print "New" for the first connection and "Reused" for the next four, confirming resumption works.

A useful exercise after the basic measurements: disable AES-NI on your CPU temporarily (sudo modprobe -r aesni_intel; sudo modprobe aesni-intel disable=1 on some kernels, or boot with noaes kernel parameter) and re-run openssl speed -evp aes-128-gcm. Throughput should drop from ~5 GB/s to ~150 MB/s — a 30× regression. This is the exercise that builds the intuition for why cipher hardware matters at the procurement decision; the same workload on a CPU SKU without AES-NI consumes 30× the CPU just for the cipher, multiplying the entire fleet size needed.

Where this leads next

This chapter is the fifth in Part 12 — the costs your code does not contain but does pay. The previous four covered syscall overhead, context-switch cost, scheduler latency, and the cost of logging. This one covers the cost of the encryption layer that wraps every external connection, hidden behind a library call but consuming substantial CPU and memory. Together the five chapters describe the full spectrum of "work done on your behalf by the kernel and runtime libraries, invoiced silently to your service's bill."

/wiki/syscall-overhead — Part 12 ch. 83. The boundary cost every TLS read/write pays.
/wiki/context-switch-cost — Part 12 ch. 85. The cost an SSL accept thread pays when handshake completion wakes a new flow.
/wiki/scheduler-latency — Part 12 ch. 84. The wait incurred when the crypto-bound LB competes for CPU.
/wiki/the-cost-of-logging — Part 12 ch. 86. The other invisible cost, structurally similar to TLS in how it scales.
/wiki/m-m-c-and-the-server-pool — the queueing-theory model behind handshake-rate spikes; explains why crypto CPU saturates sharply rather than gradually.

A senior engineer reading the next four chapters builds a complete map of "where did my latency budget go to things outside my application code". Crypto is the most architecturally consequential because it wraps every external connection, and the configuration choices (RSA vs ECDSA, resumption on/off, termination layer) have multi-year compute-cost implications. A team that gets the TLS configuration right early ships more capacity per rupee for years; a team that gets it wrong pays a structural tax on every request and discovers the cost only when traffic grows past the wrong configuration's break point.

The right organisational pattern is to make TLS configuration a code-reviewed artefact, owned by a platform or security team rather than each application team, with explicit benchmark numbers attached to any change. The nginx ssl_* directives, the haproxy bind ... ssl crt lines, the istio PeerAuthentication policies — all of these belong in a versioned tls.yaml (or equivalent) that is reviewed before each change. The teams that treat TLS config as application code catch regressions in PR review; the teams that treat it as deployment trivia catch them in production.

A practical follow-up worth committing to muscle memory: when you next profile a TLS-terminating service in production, search the flamegraph for any frame containing RSA_, EC_KEY_, tls_construct, tls_process, ssl3_, EVP_DecryptUpdate, or EVP_EncryptUpdate. If their combined CPU is above 15%, your TLS configuration is on the critical path and one configuration change away from being the bottleneck. The fix order is: enable session resumption first (largest win, lowest risk), switch to ECDSA second (large win, moderate operational effort), enable TLS 1.3 third (modernisation), terminate at the edge fourth (architectural). This ladder catches 90% of TLS-related capacity issues in Indian fintech production, in roughly the order of frequency they actually appear.

A closing framing: the cost of TLS is the cost of trusting your network, paid in CPU and memory at the moment connections are made. The right configuration buys you that trust for under 3% of CPU and minimal memory pressure. The wrong configuration buys you the same trust for 40-70% of CPU and the autoscaler bills that follow. The difference between the two is roughly twenty lines of configuration in nginx or haproxy — not a code change, not a redesign, not new hardware. The teams that get this right early ship faster because they trust their gateway; the teams that get it wrong learn to fear configuration changes, and either ship slower (out of caution) or page on-call more (when caution loses). Knowing which six phases of cost your TLS terminator pays, and which one will dominate when the next configuration change rolls, is the difference between a gateway that quietly handles the next IPL final and one that triples its instance count by mid-match.

A second closing observation worth internalising: TLS cost is not measured in "is encryption slow" terms; it is measured in "how does cost respond to configuration drift" terms. A perfectly-configured TLS gateway costs negligible CPU at any traffic level. A subtly-misconfigured one (resumption disabled by accident, RSA cert renewed with the wrong type, AEAD list missing ChaCha20) costs 5-15× more, with the multiplier invisible until peak load exposes it. The operational discipline that prevents incidents is to alert on the configuration's effect (resumption hit rate, fresh handshake rate, crypto CPU fraction) rather than on the configuration itself, because the configuration files look fine after a typo and only the metrics reveal the cost. The teams that build this monitoring early treat TLS as the first-class production cost it actually is; the teams that don't, learn to.

References

RFC 8446 — The Transport Layer Security (TLS) Protocol Version 1.3 — the canonical specification of TLS 1.3, including the handshake redesign and the security analysis of session resumption.
Adam Langley, "ImperialViolet" blog — TLS posts (2014-2020) — the most accessible writing on TLS performance in production, covering ECDSA migration, ChaCha20 deployment, and the operational lessons from running TLS at Google scale.
OpenSSL speed benchmark documentation — the reference for measuring asymmetric and symmetric cipher cost in isolation, with predictable methodology across hardware.
Cloudflare blog, "TLS 1.3 in production" (2018) — the case study for what changes when a major CDN deploys TLS 1.3 across hundreds of points of presence, with measured 0-RTT savings and resumption-cache architecture.
Brendan Gregg, Systems Performance (2nd ed., 2020), §10 "Network" — the canonical text for the network performance context that TLS overlays.
Ilya Grigorik, High Performance Browser Networking, Chapter 4 "Transport Layer Security (TLS)" — the most thorough end-to-end treatment of TLS performance from the client perspective.
/wiki/syscall-overhead — Part 12 ch. 83. The boundary cost the TLS read/write path pays per record.
/wiki/the-cost-of-logging — Part 12 ch. 86. The structurally similar invisible cost; both scale with configuration drift rather than traffic.