In short
Redis keeps the entire dataset in RAM, so a crash, a kernel panic, or a careless kill -9 loses everything that has not been written to disk. Persistence is the set of mechanisms that let a restarted Redis reconstruct its keyspace, and there are two: RDB and AOF. RDB (Redis DataBase) takes a periodic point-in-time snapshot of the whole memory image to a binary file (dump.rdb). When the snapshot trigger fires — by default, "5 minutes have passed and at least 100 keys changed", configurable in redis.conf as save 300 100 — the parent process calls fork(), the kernel copies the page tables (not the data — copy-on-write), the child walks the keyspace and writes a tightly packed binary dump, then exits. The parent keeps serving every command at full speed during the snapshot. RDB files are small (often 5–10× smaller than the in-memory representation), restart is fast (one sequential disk read into typed structures), but everything written between snapshots is lost on a crash — typically minutes of data. AOF (Append-Only File) takes the opposite shape: every write command is appended to a log file in the RESP protocol as it executes (*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n). Restart replays the log command-for-command, reconstructing the exact final state. The file is always within one fsync interval of current, so durability is much better — but the file is bigger (every operation, not the final state) and restart is slower (replaying ten million commands takes minutes). AOF gets a periodic BGREWRITEAOF that compacts the log: a child fork walks the current memory and writes the shortest equivalent script (SET counter 1247 instead of 1247 INCRs), then atomically renames the new file over the old. The real durability knob is appendfsync: always (fsync every write, slowest, zero loss), everysec (fsync once per second, default, lose at most ~1 second of writes), no (let the OS decide, fastest, can lose ~30 seconds on a Linux default page-cache flush). Production wisdom: run both RDB and AOF. AOF gives you the second-grained durability; RDB gives you the fast restart (a 50 GB AOF replay takes minutes, the equivalent RDB load takes seconds). On startup Redis prefers AOF if it exists (more current); RDB is the fallback. The classic failure mode this prevents: a Diwali-night flash sale on an Indian e-commerce site, a single Redis box holding 100 K active sessions, an OOM-killer crash at 2 AM — without persistence every shopper logs out mid-checkout; with appendfsync everysec the worst case is one second of session writes lost and nobody notices.
The previous chapter framed Redis as an in-memory database whose product is its data structures. The "in-memory" half of that framing is what makes Redis fast — every read is a hash lookup followed by a structure operation, no disk anywhere on the hot path. The "in-memory" half is also what makes Redis dangerous: pull the power cord and the entire dataset is gone the instant the kernel reaps the process. If your Redis is genuinely a cache in front of Postgres — every key derivable from the source of truth on a slow miss — that is fine. But the moment Redis holds anything that only lives in Redis (a session, a leaderboard, a delayed-job queue, a rate-limit counter, a streaming consumer's last-acked ID), losing it on restart is a real outage. Persistence is the discipline of writing enough state to disk that a restart can reconstruct the keyspace.
Redis offers two persistence mechanisms — RDB (snapshots) and AOF (append-only file) — with very different trade-offs. They are not mutually exclusive; the production-recommended setup runs both. This chapter walks the mechanics, the fsync knob that sets the real durability budget, the BGREWRITEAOF compaction that keeps AOF files from growing without bound, and the combined setup an Indian fintech actually uses for session storage on a Diwali night.
RDB: a periodic full snapshot of memory
The simplest possible persistence policy is "every so often, write the whole dataset to a file". RDB is exactly that. The Redis server keeps the keyspace in RAM and, on a configurable trigger, writes a point-in-time binary snapshot of every key to a file called dump.rdb (the name is configurable; the default location is the working directory). On restart, the server loads dump.rdb and you are back where you were at the moment of the snapshot — minus everything that was written after.
The trigger is configured in redis.conf with one or more save <seconds> <changes> lines. The defaults read like this:
save 3600 1 # snapshot if 1 change in the last hour
save 300 100 # snapshot if 100 changes in the last 5 minutes
save 60 10000 # snapshot if 10 000 changes in the last minute
Any line that matches triggers a snapshot — so a write-heavy workload snapshots every minute, a quiet workload snapshots every hour, and a near-idle one snapshots once a day. You can disable RDB entirely with save "" and you can force an immediate snapshot with the BGSAVE command (background save) or SAVE (foreground, blocking — almost never the right call in production).
The mechanics are worth dwelling on because they explain both the magic and the gotchas. When BGSAVE fires, the parent calls fork(). On Linux (and every other Unix), fork() does not copy the parent's 5 GB of memory; it copies the page tables — a small per-page lookup structure — and marks every page copy-on-write. Both processes now see the same physical pages. The child immediately starts walking the keyspace and serialising every key to a binary file. Meanwhile the parent keeps serving every command at full speed. When the parent writes (because a client called SET or INCR), the kernel intercepts the page fault, allocates a fresh physical page, copies the original 4 KB into it, and lets the parent modify the copy. The child still sees the original — which is exactly what we want, because the snapshot must be consistent at the moment of the fork.
The cost of COW scales with the write rate during the snapshot, not with the dataset size. A 5 GB Redis with no writes during a 30-second snapshot uses zero extra RAM; a 5 GB Redis with heavy churn during the snapshot might use 10–20 % extra. The classic operational trap is provisioning a Redis box at exactly 50 % of host RAM and having the snapshot OOM the host — the rule of thumb is "leave at least the size of your peak write churn during a snapshot as headroom", and in practice maxmemory should be set to 60–70 % of host RAM if you want safe RDB snapshots.
The RDB binary format itself is dense and clever: every key is preceded by a one-byte type tag (1 = LIST, 2 = SET, 3 = ZSET, 4 = HASH, 5 = ZSET v2, 6 = HASH v2, 9 = ZIPMAP, ...), values are length-prefixed, integers and short strings are encoded specially, and the whole stream is optionally LZF-compressed. The result is typically 5–10× smaller than the in-memory representation (the in-memory version pays for hash-table overhead, pointer chasing, and 8-byte alignment). On restart, Redis streams the file from disk straight into the typed structures — a 5 GB in-memory keyspace dumped to a 500 MB RDB file might load in 8–10 seconds on an SSD, which is a fast restart by any standard.
The gotcha is the one the diagram screams: between snapshots, you have nothing. If dump.rdb was last written 4 minutes ago and the box crashes now, the last 4 minutes of writes are gone. For a write-heavy workload at 50 K writes/sec, that is 12 million lost commands. For a session store, that is potentially every login since 11:56 PM logged out at 12:00 AM. RDB by itself is the right answer for caches (you can rebuild the cache from the source of truth) and for analytical/pre-computed datasets that change rarely. It is the wrong answer for anything that lives only in Redis. For that you need AOF.
AOF: append every write command to a log
AOF takes the opposite philosophy: do not snapshot, log. Every write command (SET, INCR, LPUSH, ZADD, ...) is appended to a file (appendonly.aof by default) in the same RESP wire format the client sent. Read commands (GET, ZRANGE, ...) are not logged because they do not change state. On restart, Redis spawns a fake client, replays every command in the AOF file in order, and reconstructs the exact final state the live keyspace was in just before the crash — minus, at worst, whatever was sitting in the OS page cache and had not been fsynced to disk yet.
Three configuration choices control AOF behaviour. The first is appendonly yes (it's no by default in stock redis.conf). The second is the file name (appendfilename "appendonly.aof"). The third — the one that actually matters — is appendfsync, which decides how often the OS is told to flush the page cache to the underlying disk.
The trade-off here is the heart of every persistent system, not just Redis. Writing to a file with write(2) is fast because it just copies bytes into the kernel's page cache; the bytes are not actually on the disk yet. fsync(2) forces the kernel to flush every dirty page for that file to the physical storage and waits for the storage to acknowledge. A modern NVMe SSD can do an fsync in 50–500 microseconds; a magnetic disk takes 5–10 milliseconds; an EBS gp3 volume on AWS takes around 1 ms; a network-attached file system can take 10+ ms. Whichever it is, fsyncing on every write turns Redis from a 100 K-ops-per-second machine into a 10 K-ops-per-second machine, because every write now waits for the disk. The three modes:
appendfsync always— fsync after every write command. Zero data loss on a Redis crash; you only lose what was in flight on the network. Slowest mode by 5–10×; only used by people whose definition of "data" is "money in a wallet".appendfsync everysec— fsync once per second, from a background thread. The default and the right answer for ~95 % of deployments. Worst-case loss is one second of writes, the foreground event loop is never blocked on disk, and the throughput penalty vs. no-fsync is single-digit percent.appendfsync no— never call fsync; let the OS decide when to flush. Linux's page cache typically holds dirty pages for ~30 seconds before flushing under default tunings (vm.dirty_expire_centisecs). Worst-case loss is ~30 seconds. Used only when you really do not care about losing recent writes.
Why everysec is the default: it picks the elbow of the throughput-vs-durability curve. Going from no to everysec costs you almost nothing in throughput (the fsync runs in a background thread, the foreground loop never blocks) but caps your loss window at 1 second instead of 30. Going from everysec to always costs you an order of magnitude in throughput to gain that one second back. For 99 % of workloads, "lose at most 1 second of writes on a crash" is the right answer.
BGREWRITEAOF: keeping the log from eating the disk
The obvious problem with logging every command is that the log grows without bound. A counter that has been incremented ten million times sits in memory as a single integer key but has ten million INCR lines in the AOF. A list that has been pushed and popped a billion times in a queue workload may currently hold three items but has two billion lines in its history. After a few weeks of running, the AOF can be ten or a hundred times the size of the keyspace, and every restart has to chew through all of it.
BGREWRITEAOF fixes this. The command — which can be triggered manually or, more usefully, automatically when the file grows past a configured threshold — does the same fork() trick that RDB does: spawn a child, walk the current in-memory keyspace, and write the shortest equivalent script to a fresh file. The counter that was incremented ten million times becomes one line: SET counter 10000000. The list that was churned through becomes a single RPUSH of its current contents. Everything that has been deleted is simply absent. The new file is the smallest AOF that, replayed from scratch, reconstructs the current keyspace.
While the child is writing the rewrite, the parent keeps serving traffic and also keeps appending new commands to a buffer (the rewrite buffer). When the child finishes, the parent appends the buffer to the rewritten file and atomically renames it over the old one. From the client's perspective nothing happens; from the disk's perspective the file just shrunk by 10× or 100×.
The automatic trigger is configured by two lines:
auto-aof-rewrite-percentage 100 # rewrite when AOF doubles in size
auto-aof-rewrite-min-size 64mb # but never below 64 MB
Defaults: rewrite when the AOF is at least 100 % larger than it was after the last rewrite, but only if the file is at least 64 MB (so a small Redis does not rewrite constantly). For a typical workload this means the AOF settles into a sawtooth: it grows during normal operation, then shrinks back to roughly "size of the in-memory keyspace expressed as RESP commands" after each rewrite.
Modern Redis (7.x and later) enhances this with the multi-part AOF layout: the rewritten "base" snapshot lives as a binary RDB file (appendonly.aof.1.base.rdb), and only the commands appended since the last rewrite live as RESP text (appendonly.aof.1.incr.aof). On replay, Redis loads the base snapshot fast and then replays only the incremental log — closing the gap between AOF's slow restart and RDB's fast one. Why this matters: in old Redis, even a freshly-rewritten AOF was still RESP text that had to be parsed and re-executed command by command. The 7.x manifest layout makes the base load identical to an RDB load (sequential read into typed structures) and only the last few seconds of writes have to go through the slow RESP-replay path. A 50 GB instance that used to take 5 minutes to recover now takes ~30 seconds.
Combined RDB + AOF: the production-recommended setup
Neither RDB nor AOF alone is right for production. RDB by itself loses minutes of data on a crash. AOF by itself takes painfully long to replay on restart for any large keyspace. The recommended setup is to run both: AOF gives you the fine-grained durability (the second-grained loss budget); RDB gives you the fast restart (and a convenient backup format you can copy to S3).
A typical production redis.conf for a session-storage workload looks like this:
# RDB: snapshot every 5 min if 100 keys changed (also nightly backup target)
save 300 100
save 60 10000
dbfilename dump.rdb
dir /var/lib/redis
# AOF: enabled, default fsync policy
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
# Auto-rewrite when AOF doubles, with a 64 MB floor
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Stop accepting writes if disk is full and AOF is failing — better than data loss
no-appendfsync-on-rewrite no
aof-load-truncated yes
The aof-load-truncated yes line is worth a callout: if the AOF was being written when the box crashed, the last command might be half-written. Setting this to yes tells Redis to silently truncate the partial command and load everything before it (losing at most one command). Setting it to no makes Redis refuse to start until you fix the file by hand, which is rarely what you want at 3 AM.
A worked example: an Indian fintech and the 2 AM OOM crash
Redis as a session store on Diwali night
You are the SRE on call for paisa.in, an Indian fintech offering UPI payments and instant credit. The login flow stores each user's session in Redis: a HASH per session ID containing user_id, csrf_token, 2fa_passed_at, last_active. The session TTL is 30 minutes; each user touches their session ~3 times per minute while active. On a typical day you have 100 K concurrent sessions, peaking to 250 K on the 1st and 7th of every month (salary-credit days).
Diwali night, 2:13 AM. The box is at 14 GB RAM, maxmemory is 16 GB, the kernel's OOM killer fires on something else and Redis gets caught in the cascade. The process dies. The Redis box restarts in 8 seconds. What happens to your sessions depends entirely on your persistence config.
Scenario A: no persistence (save "" and appendonly no).
# Restart sequence:
redis_server.start()
db_size = redis.dbsize() # 0 — every key gone
Every one of 180 K active sessions (it's a long weekend) is invalidated. Every user trying to make a UPI payment hits "session expired", gets bounced to the login page, has to re-enter their UPI PIN. Customer-support tickets spike to 200/min. The CEO calls. You spend the next two hours apologising on Twitter.
Scenario B: RDB only (save 300 100, appendonly no).
redis_server.start()
db_size = redis.dbsize() # ~165 K — sessions that existed at last snapshot
# but: the snapshot was at 2:08 AM — five minutes ago
# every login, every refreshed session, every 2FA token validated
# between 2:08 and 2:13 is gone
About 15 K users — anyone who logged in or touched their session in the last 5 minutes — is logged out mid-action. UPI payments mid-flight at the moment of the crash fail because their 2fa_passed_at field reverted to a value before they completed 2FA. Better than scenario A, still bad.
Scenario C: AOF with appendfsync everysec (recommended).
redis_server.start()
# Restart loads dump.rdb (if RDB+AOF combined) or replays AOF
# In hybrid 7.x: base RDB loads in 4 s, incremental tail replays in 80 ms
db_size = redis.dbsize() # ~180 K — every session except the last second of writes
The worst that can have happened is that the writes from 2:13:00.500 to 2:13:01.000 were sitting in the kernel page cache and not yet fsynced. That is, at 50 K writes/sec across all sessions, ~25 K writes lost — but for a session HASH, "lost write" means "the session's last_active field is one second stale", not "the session is gone". No user notices. UPI payments in flight retry idempotently and succeed. Customer-support tickets are flat. The CEO does not call.
The implementation is exactly two lines in redis.conf:
appendonly yes
appendfsync everysec
plus a once-a-week verification:
import subprocess, time
def verify_persistence_works():
"""Backup the AOF, kill -9 redis, restart, check dbsize and a known key."""
subprocess.run(['cp', '/var/lib/redis/appendonly.aof', f'/tmp/aof.{int(time.time())}'])
pre_count = redis.dbsize()
subprocess.run(['systemctl', 'kill', '-s', 'KILL', 'redis'])
time.sleep(2)
subprocess.run(['systemctl', 'start', 'redis'])
time.sleep(15) # give it time to replay
post_count = redis.dbsize()
assert post_count >= pre_count - 100, f'Lost too many keys: {pre_count} -> {post_count}'
return post_count
Run that as a synthetic test in staging once a week. The first time it catches a misconfigured appendfsync no line in production, it pays for itself.
The lesson generalises beyond fintech sessions. Anywhere Redis is the source of truth — leaderboards in a gaming app, delayed jobs in a queue system, OTP-rate-limit counters during festival-traffic spikes — the choice between "snapshot every 5 minutes" and "log every write with 1-second fsync" is the choice between "this outage costs us money" and "this outage is invisible to users".
Restart speed: a back-of-the-envelope
The two persistence shapes have very different restart costs, and the cost matters because it sets the upper bound on your downtime during a planned restart, a node replacement, or a failover. Pseudocode for the restart sequence makes the difference concrete:
def redis_startup():
# Modern Redis 7.x — multi-part AOF
if has_aof():
# 1. Load the base RDB (sequential read into typed structures)
load_rdb_file('appendonly.aof.1.base.rdb') # ~50 MB/s effective on SSD
# 2. Replay the incremental RESP tail
replay_aof_file('appendonly.aof.1.incr.aof') # ~500 K cmds/sec
elif has_rdb():
# Legacy: just RDB
load_rdb_file('dump.rdb') # fast — typed binary
else:
# No persistence — start with empty keyspace
pass
accept_clients()
The numbers that matter on commodity hardware (NVMe SSD, modern CPU):
| Format / scenario | 50 GB in-memory keyspace |
|---|---|
| RDB load (5 GB on disk after compression) | ~30 seconds |
| Pure AOF replay (200 GB log, 500 M commands) | ~17 minutes |
| Hybrid AOF (5 GB base RDB + 1 GB incr tail = 5 M cmds) | ~50 seconds |
A pure AOF restart on a large instance is the operational nightmare that pushed Redis 7 to introduce the hybrid format. Before 7.x, the running joke was "AOF is for the day you forget to fsync; RDB is for the day you have to actually restart". Now, with aof-use-rdb-preamble yes (the default since Redis 4 actually, but improved in 7), you genuinely get both.
Going deeper: persistence in a replicated world
The story above treats a single Redis box. Production Redis is almost always a primary with one or more replicas, and the persistence story changes shape when replication is in the picture. The full discussion belongs in chapter 171 (replication, Sentinel, and Cluster), but two interactions are worth flagging here.
Diskless replication and the snapshot you do not need
When a fresh replica connects to a primary for the first time (or after a long disconnect), the primary needs to send a complete copy of its keyspace. The classic mechanism is exactly RDB: the primary forks, the child writes an RDB file to disk, and then the primary streams the file over the socket to the replica. Diskless replication (repl-diskless-sync yes, default since 6.0) skips the disk: the primary forks and the child writes the RDB stream directly to the replica's socket. No disk I/O, faster sync, useful when the disk is slow but the network is fast. The replica then enables AOF (or not) on its own copy according to its own config.
Persistence on the replica vs. the primary
A common operational shape is AOF on the replica, RDB only on the primary. The primary stays as fast as possible (no fsync overhead on the hot path), the replica eats the durability cost, and if the primary dies you fail over to the replica, which has the full AOF. The risk: there is a small replication-lag window where the primary acked a write and the replica had not yet received it. If the primary dies in that window and you fail over, the write is genuinely lost. Whether you can tolerate that is a product question.
WAIT N timeout for synchronous replication
Redis offers a partial answer to the replication-loss problem: WAIT 1 100 after a write tells the client to block until at least 1 replica has acknowledged the write, with a 100 ms timeout. If the primary dies and you fail over to that replica, the write survives. This is "synchronous replication for the writes that matter", and it is the Redis equivalent of Postgres's synchronous_commit = remote_apply. Using it costs you a network round trip per write, so most workloads use it only for the few critical commands (UPI debit, password change) and not for everything.
What to take away
Persistence is the discipline that turns Redis from "a fast cache" into "a fast database". Two mechanisms, two shapes, three configuration knobs that matter:
- RDB = periodic full snapshot, fork+COW writes a dense binary
dump.rdb, fast restart, lose minutes of data between snapshots. - AOF = log every write in RESP, replay on restart, lose at most one fsync interval (typically 1 second), bigger file, slower replay.
- The fsync knob (
appendfsync always | everysec | no) sets your real durability budget.everysecis the elbow of the curve and the right answer for ~95 % of deployments. BGREWRITEAOFkeeps the AOF from growing without bound by walking the live keyspace and writing the shortest equivalent script.- Run both in production. AOF for second-grained durability, RDB for fast restart and easy backups (copy
dump.rdbto S3 nightly). Modern Redis fuses them into a hybrid format that gives you AOF's freshness with RDB's load speed.
The next chapter scales the picture out from one Redis box to a fleet: replication for read scale-out and failover, Sentinel for automatic primary election, and Cluster for sharding the keyspace across many primaries when one box's RAM is no longer enough.
References
- Redis Persistence — official documentation — the canonical reference, covers RDB, AOF, and the hybrid setup with current defaults.
- Salvatore Sanfilippo (antirez), "Redis persistence demystified" — the original blog post by Redis's creator explaining RDB and AOF trade-offs in detail.
- Redis AOF rewrite — multi-part AOF design — the Redis 7.0 PR that introduced the manifest-based hybrid AOF (base RDB + incremental RESP).
fsync(2)— Linux man page — the syscall every persistent system depends on; understanding its cost explains theappendfsyncknob.- Redis Replication — companion documentation covering diskless sync and the
WAITcommand for partial synchronous replication. - Aphyr, "Jepsen: Redis" — an external durability analysis showing what can and cannot be guaranteed even with AOF and
WAIT; required reading before betting money on Redis.