Persistence: RDB snapshots and AOF

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Redis lives in RAM, so without persistence a crash erases everything. RDB writes periodic binary snapshots of memory via fork() and copy-on-write — small file, fast restart, but minutes of data lost between snapshots. AOF appends every write command in RESP to a log; replay reconstructs the exact keyspace within one appendfsync interval (everysec = ~1 s loss, the right default). Production runs both: AOF for second-grained durability, RDB for fast restart and easy backups.

The "in-memory" framing from the previous chapter is what makes Redis fast — and what makes it dangerous: pull the power cord and the entire dataset is gone. The moment Redis holds anything that only lives in Redis (a session, a leaderboard, a delayed-job queue), losing it on restart is a real outage. Persistence is the discipline of writing enough state to disk that a restart can reconstruct the keyspace, and Redis offers two complementary mechanisms — RDB and AOF — that you almost always run together in production.

RDB: a periodic full snapshot of memory

The simplest possible persistence policy is "every so often, write the whole dataset to a file". RDB is exactly that. The Redis server keeps the keyspace in RAM and, on a configurable trigger, writes a point-in-time binary snapshot of every key to a file called dump.rdb (the name is configurable; the default location is the working directory). On restart, the server loads dump.rdb and you are back where you were at the moment of the snapshot — minus everything that was written after.

The trigger is configured in redis.conf with one or more save <seconds> <changes> lines. The defaults read like this:

save 3600 1       # snapshot if 1 change in the last hour
save 300 100      # snapshot if 100 changes in the last 5 minutes
save 60 10000     # snapshot if 10 000 changes in the last minute

Any line that matches triggers a snapshot — so a write-heavy workload snapshots every minute, a quiet workload snapshots every hour, and a near-idle one snapshots once a day. You can disable RDB entirely with save "" and you can force an immediate snapshot with the BGSAVE command (background save) or SAVE (foreground, blocking — almost never the right call in production).

RDB snapshot via `fork()` + copy-on-write. The parent keeps serving traffic at full speed while the child walks the keyspace and writes a compact binary dump. The dump is small (~10× compression vs. the in-memory representation) and restart is fast (one sequential disk read into already-typed structures). The price is **data loss between snapshots**: a crash three minutes after the last snapshot loses three minutes of writes. Why fork+COW is brilliant: the OS gives you a "logical clone" of the entire process for free — the child sees a frozen-in-time view of memory without actually copying 5 GB. Pages are only physically copied when the parent (still serving writes) modifies them, so the extra RAM cost scales with **write rate during the snapshot**, not with dataset size. A 5 GB Redis snapshotting in 30 seconds with 1 % page churn uses ~50 MB of extra memory — almost free.

The mechanics are worth dwelling on because they explain both the magic and the gotchas. When BGSAVE fires, the parent calls fork(). On Linux (and every other Unix), fork() does not copy the parent's 5 GB of memory; it copies the page tables — a small per-page lookup structure — and marks every page copy-on-write. Both processes now see the same physical pages. The child immediately starts walking the keyspace and serialising every key to a binary file. Meanwhile the parent keeps serving every command at full speed. When the parent writes (because a client called SET or INCR), the kernel intercepts the page fault, allocates a fresh physical page, copies the original 4 KB into it, and lets the parent modify the copy. The child still sees the original — which is exactly what we want, because the snapshot must be consistent at the moment of the fork.

The cost of COW scales with the write rate during the snapshot, not with the dataset size. A 5 GB Redis with no writes during a 30-second snapshot uses zero extra RAM; a 5 GB Redis with heavy churn during the snapshot might use 10–20 % extra. The classic operational trap is provisioning a Redis box at exactly 50 % of host RAM and having the snapshot OOM the host — the rule of thumb is "leave at least the size of your peak write churn during a snapshot as headroom", and in practice maxmemory should be set to 60–70 % of host RAM if you want safe RDB snapshots.

The RDB binary format itself is dense and clever: every key is preceded by a one-byte type tag (1 = LIST, 2 = SET, 3 = ZSET, 4 = HASH, 5 = ZSET v2, 6 = HASH v2, 9 = ZIPMAP, ...), values are length-prefixed, integers and short strings are encoded specially, and the whole stream is optionally LZF-compressed. The result is typically 5–10× smaller than the in-memory representation (the in-memory version pays for hash-table overhead, pointer chasing, and 8-byte alignment). On restart, Redis streams the file from disk straight into the typed structures — a 5 GB in-memory keyspace dumped to a 500 MB RDB file might load in 8–10 seconds on an SSD, which is a fast restart by any standard.

The gotcha is the one the diagram screams: between snapshots, you have nothing. If dump.rdb was last written 4 minutes ago and the box crashes now, the last 4 minutes of writes are gone. For a write-heavy workload at 50 K writes/sec, that is 12 million lost commands. For a session store, that is potentially every login since 11:56 PM logged out at 12:00 AM. RDB by itself is the right answer for caches (you can rebuild the cache from the source of truth) and for analytical/pre-computed datasets that change rarely. It is the wrong answer for anything that lives only in Redis. For that you need AOF.

AOF: append every write command to a log

AOF takes the opposite philosophy: do not snapshot, log. Every write command (SET, INCR, LPUSH, ZADD, ...) is appended to a file (appendonly.aof by default) in the same RESP wire format the client sent. Read commands (GET, ZRANGE, ...) are not logged because they do not change state. On restart, Redis spawns a fake client, replays every command in the AOF file in order, and reconstructs the exact final state the live keyspace was in just before the crash — minus, at worst, whatever was sitting in the OS page cache and had not been fsynced to disk yet.

AOF turns the keyspace into an event log. Every write command is logged in the RESP wire format as it executes; restart replays the log to reconstruct the exact final state. The file is **always within one fsync interval of being current** — that interval is the real durability knob. The cost is two-fold: the file grows much larger than RDB (every command, not the final state), and restart is slower because replay scales with **command count**, not key count. `BGREWRITEAOF` (next section) addresses the size; replay speed is the unavoidable trade-off you accept in exchange for fine-grained durability.

Three configuration choices control AOF behaviour. The first is appendonly yes (it's no by default in stock redis.conf). The second is the file name (appendfilename "appendonly.aof"). The third — the one that actually matters — is appendfsync, which decides how often the OS is told to flush the page cache to the underlying disk.

The trade-off here is the heart of every persistent system, not just Redis. Writing to a file with write(2) is fast because it just copies bytes into the kernel's page cache; the bytes are not actually on the disk yet. fsync(2) forces the kernel to flush every dirty page for that file to the physical storage and waits for the storage to acknowledge. A modern NVMe SSD can do an fsync in 50–500 microseconds; a magnetic disk takes 5–10 milliseconds; an EBS gp3 volume on AWS takes around 1 ms; a network-attached file system can take 10+ ms. Whichever it is, fsyncing on every write turns Redis from a 100 K-ops-per-second machine into a 10 K-ops-per-second machine, because every write now waits for the disk. The three modes:

appendfsync always — fsync after every write command. Zero data loss on a Redis crash; you only lose what was in flight on the network. Slowest mode by 5–10×; only used by people whose definition of "data" is "money in a wallet".
appendfsync everysec — fsync once per second, from a background thread. The default and the right answer for ~95 % of deployments. Worst-case loss is one second of writes, the foreground event loop is never blocked on disk, and the throughput penalty vs. no-fsync is single-digit percent.
appendfsync no — never call fsync; let the OS decide when to flush. Linux's page cache typically holds dirty pages for ~30 seconds before flushing under default tunings (vm.dirty_expire_centisecs). Worst-case loss is ~30 seconds. Used only when you really do not care about losing recent writes.

Why everysec is the default: it picks the elbow of the throughput-vs-durability curve. Going from no to everysec costs you almost nothing in throughput (the fsync runs in a background thread, the foreground loop never blocks) but caps your loss window at 1 second instead of 30. Going from everysec to always costs you an order of magnitude in throughput to gain that one second back. For 99 % of workloads, "lose at most 1 second of writes on a crash" is the right answer.

BGREWRITEAOF: keeping the log from eating the disk

The obvious problem with logging every command is that the log grows without bound. A counter that has been incremented ten million times sits in memory as a single integer key but has ten million INCR lines in the AOF. A list that has been pushed and popped a billion times in a queue workload may currently hold three items but has two billion lines in its history. After a few weeks of running, the AOF can be ten or a hundred times the size of the keyspace, and every restart has to chew through all of it.

BGREWRITEAOF fixes this. The command — which can be triggered manually or, more usefully, automatically when the file grows past a configured threshold — does the same fork() trick that RDB does: spawn a child, walk the current in-memory keyspace, and write the shortest equivalent script to a fresh file. The counter that was incremented ten million times becomes one line: SET counter 10000000. The list that was churned through becomes a single RPUSH of its current contents. Everything that has been deleted is simply absent. The new file is the smallest AOF that, replayed from scratch, reconstructs the current keyspace.

While the child is writing the rewrite, the parent keeps serving traffic and also keeps appending new commands to a buffer (the rewrite buffer). When the child finishes, the parent appends the buffer to the rewritten file and atomically renames it over the old one. From the client's perspective nothing happens; from the disk's perspective the file just shrunk by 10× or 100×.

The automatic trigger is configured by two lines:

auto-aof-rewrite-percentage 100        # rewrite when AOF doubles in size
auto-aof-rewrite-min-size 64mb         # but never below 64 MB

Defaults: rewrite when the AOF is at least 100 % larger than it was after the last rewrite, but only if the file is at least 64 MB (so a small Redis does not rewrite constantly). For a typical workload this means the AOF settles into a sawtooth: it grows during normal operation, then shrinks back to roughly "size of the in-memory keyspace expressed as RESP commands" after each rewrite.

Modern Redis (7.x and later) enhances this with the multi-part AOF layout: the rewritten "base" snapshot lives as a binary RDB file (appendonly.aof.1.base.rdb), and only the commands appended since the last rewrite live as RESP text (appendonly.aof.1.incr.aof). On replay, Redis loads the base snapshot fast and then replays only the incremental log — closing the gap between AOF's slow restart and RDB's fast one. Why this matters: in old Redis, even a freshly-rewritten AOF was still RESP text that had to be parsed and re-executed command by command. The 7.x manifest layout makes the base load identical to an RDB load (sequential read into typed structures) and only the last few seconds of writes have to go through the slow RESP-replay path. A 50 GB instance that used to take 5 minutes to recover now takes ~30 seconds.

Combined RDB + AOF: the production-recommended setup

Neither RDB nor AOF alone is right for production. RDB by itself loses minutes of data on a crash. AOF by itself takes painfully long to replay on restart for any large keyspace. The recommended setup is to run both: AOF gives you the fine-grained durability (the second-grained loss budget); RDB gives you the fast restart (and a convenient backup format you can copy to S3).

The combined setup: AOF for second-grained durability, RDB for fast restart and easy backup, periodic `BGREWRITEAOF` (which uses fork+COW just like RDB does) to keep the AOF from growing without bound. On modern Redis the rewrite produces a hybrid file — the base is binary RDB for fast load, the tail is RESP for the latest commands — closing the gap between AOF's slow restart and RDB's fast one. On startup Redis **prefers AOF** when both files exist, because AOF is by definition the more recent of the two.

A typical production redis.conf for a session-storage workload looks like this:

# RDB: snapshot every 5 min if 100 keys changed (also nightly backup target)
save 300 100
save 60 10000
dbfilename dump.rdb
dir /var/lib/redis

# AOF: enabled, default fsync policy
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec

# Auto-rewrite when AOF doubles, with a 64 MB floor
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Stop accepting writes if disk is full and AOF is failing — better than data loss
no-appendfsync-on-rewrite no
aof-load-truncated yes

The aof-load-truncated yes line is worth a callout: if the AOF was being written when the box crashed, the last command might be half-written. Setting this to yes tells Redis to silently truncate the partial command and load everything before it (losing at most one command). Setting it to no makes Redis refuse to start until you fix the file by hand, which is rarely what you want at 3 AM.

A worked example: an Indian fintech and the 2 AM OOM crash

Redis as a session store on Diwali night

You are the SRE on call for paisa.in, an Indian fintech offering UPI payments and instant credit. The login flow stores each user's session in Redis: a HASH per session ID containing user_id, csrf_token, 2fa_passed_at, last_active. The session TTL is 30 minutes; each user touches their session ~3 times per minute while active. On a typical day you have 100 K concurrent sessions, peaking to 250 K on the 1st and 7th of every month (salary-credit days).

Diwali night, 2:13 AM. The box is at 14 GB RAM, maxmemory is 16 GB, the kernel's OOM killer fires on something else and Redis gets caught in the cascade. The process dies. The Redis box restarts in 8 seconds. What happens to your sessions depends entirely on your persistence config.

Scenario A: no persistence (save "" and appendonly no).

# Restart sequence:
redis_server.start()
db_size = redis.dbsize()    # 0 — every key gone

Every one of 180 K active sessions (it's a long weekend) is invalidated. Every user trying to make a UPI payment hits "session expired", gets bounced to the login page, has to re-enter their UPI PIN. Customer-support tickets spike to 200/min. The CEO calls. You spend the next two hours apologising on Chirpline.

Scenario B: RDB only (save 300 100, appendonly no).

redis_server.start()
db_size = redis.dbsize()    # ~165 K — sessions that existed at last snapshot
# but: the snapshot was at 2:08 AM — five minutes ago
# every login, every refreshed session, every 2FA token validated
# between 2:08 and 2:13 is gone

About 15 K users — anyone who logged in or touched their session in the last 5 minutes — is logged out mid-action. UPI payments mid-flight at the moment of the crash fail because their 2fa_passed_at field reverted to a value before they completed 2FA. Better than scenario A, still bad.

Scenario C: AOF with appendfsync everysec (recommended).

redis_server.start()
# Restart loads dump.rdb (if RDB+AOF combined) or replays AOF
# In hybrid 7.x: base RDB loads in 4 s, incremental tail replays in 80 ms
db_size = redis.dbsize()    # ~180 K — every session except the last second of writes

The worst that can have happened is that the writes from 2:13:00.500 to 2:13:01.000 were sitting in the kernel page cache and not yet fsynced. That is, at 50 K writes/sec across all sessions, ~25 K writes lost — but for a session HASH, "lost write" means "the session's last_active field is one second stale", not "the session is gone". No user notices. UPI payments in flight retry idempotently and succeed. Customer-support tickets are flat. The CEO does not call.

The implementation is exactly two lines in redis.conf:

appendonly yes
appendfsync everysec

plus a once-a-week verification:

import subprocess, time
def verify_persistence_works():
    """Backup the AOF, kill -9 redis, restart, check dbsize and a known key."""
    subprocess.run(['cp', '/var/lib/redis/appendonly.aof', f'/tmp/aof.{int(time.time())}'])
    pre_count = redis.dbsize()
    subprocess.run(['systemctl', 'kill', '-s', 'KILL', 'redis'])
    time.sleep(2)
    subprocess.run(['systemctl', 'start', 'redis'])
    time.sleep(15)  # give it time to replay
    post_count = redis.dbsize()
    assert post_count >= pre_count - 100, f'Lost too many keys: {pre_count} -> {post_count}'
    return post_count

Run that as a synthetic test in staging once a week. The first time it catches a misconfigured appendfsync no line in production, it pays for itself.

The lesson generalises beyond fintech sessions. Anywhere Redis is the source of truth — leaderboards in a gaming app, delayed jobs in a queue system, OTP-rate-limit counters during festival-traffic spikes — the choice between "snapshot every 5 minutes" and "log every write with 1-second fsync" is the choice between "this outage costs us money" and "this outage is invisible to users".

Restart speed: a back-of-the-envelope

The two persistence shapes have very different restart costs, and the cost matters because it sets the upper bound on your downtime during a planned restart, a node replacement, or a failover. Pseudocode for the restart sequence makes the difference concrete:

def redis_startup():
    # Modern Redis 7.x — multi-part AOF
    if has_aof():
        # 1. Load the base RDB (sequential read into typed structures)
        load_rdb_file('appendonly.aof.1.base.rdb')   # ~50 MB/s effective on SSD
        # 2. Replay the incremental RESP tail
        replay_aof_file('appendonly.aof.1.incr.aof') # ~500 K cmds/sec
    elif has_rdb():
        # Legacy: just RDB
        load_rdb_file('dump.rdb')                    # fast — typed binary
    else:
        # No persistence — start with empty keyspace
        pass
    accept_clients()

The numbers that matter on commodity hardware (NVMe SSD, modern CPU):

Format / scenario	50 GB in-memory keyspace
RDB load (5 GB on disk after compression)	~30 seconds
Pure AOF replay (200 GB log, 500 M commands)	~17 minutes
Hybrid AOF (5 GB base RDB + 1 GB incr tail = 5 M cmds)	~50 seconds

A pure AOF restart on a large instance is the operational nightmare that pushed Redis 7 to introduce the hybrid format. Before 7.x, the running joke was "AOF is for the day you forget to fsync; RDB is for the day you have to actually restart". Now, with aof-use-rdb-preamble yes (the default since Redis 4 actually, but improved in 7), you genuinely get both.

Common confusions

"AOF is slower than RDB because it writes more, so I should just use RDB." AOF writes happen on the foreground event loop only as write(2) calls into the page cache — those are microseconds, not milliseconds. The expensive part is fsync, which everysec runs in a background thread. Real-world AOF throughput is within single-digit percent of no-persistence; the RDB-only argument is about restart speed, not steady-state cost.
"appendfsync always means I never lose any data." It means you do not lose any acknowledged data after a Redis-process crash. You can still lose data to disk corruption, kernel panics that lose dirty pages mid-flush, EBS volume failure, or a network round-trip that did not complete. WAIT N to a replica plus appendfsync always gets you closer; nothing gets you to zero.
"RDB and AOF are alternatives — pick one." They solve different problems. AOF caps your loss window at one fsync interval; RDB caps your restart time at a sequential read. On large instances, pure AOF replay can take minutes to tens of minutes; pure RDB loses minutes of data. The recommended production setup runs both, and modern Redis fuses them into a hybrid file.
"fork() to snapshot a 50 GB Redis must take forever." fork() itself is fast — it copies page tables, not data. What scales with size is the child's serialisation walk. The parent never blocks on the fork beyond a few hundred milliseconds even for huge instances; clients see no pause. The cost shows up as extra RAM (COW page copies during the snapshot) and CPU on the child.
"AOF replay always gives you the same keyspace as before the crash." Only up to the last fsynced byte. If a write was in the page cache and not yet on disk when the kernel panicked, that write is gone — everysec says "lose at most ~1 second", not "lose nothing". Also, if the AOF was being written when the box died, the last command may be half-written; aof-load-truncated yes handles that gracefully.
"I should set maxmemory to 100 % of RAM to use the box fully." Then your first BGSAVE OOMs the host. Fork+COW means the snapshot can need 10–30 % extra RAM if your write churn is high during the snapshot. Set maxmemory to 60–70 % of host RAM, or disable RDB and rely on AOF only with rewrites scheduled during quiet hours.

Going deeper: persistence in a replicated world

The story above treats a single Redis box. Production Redis is almost always a primary with one or more replicas, and the persistence story changes shape when replication is in the picture. The full discussion belongs in chapter 171 (replication, Sentinel, and Cluster), but two interactions are worth flagging here.

Diskless replication and the snapshot you do not need

When a fresh replica connects to a primary for the first time (or after a long disconnect), the primary needs to send a complete copy of its keyspace. The classic mechanism is exactly RDB: the primary forks, the child writes an RDB file to disk, and then the primary streams the file over the socket to the replica. Diskless replication (repl-diskless-sync yes, default since 6.0) skips the disk: the primary forks and the child writes the RDB stream directly to the replica's socket. No disk I/O, faster sync, useful when the disk is slow but the network is fast. The replica then enables AOF (or not) on its own copy according to its own config.

Persistence on the replica vs. the primary

A common operational shape is AOF on the replica, RDB only on the primary. The primary stays as fast as possible (no fsync overhead on the hot path), the replica eats the durability cost, and if the primary dies you fail over to the replica, which has the full AOF. The risk: there is a small replication-lag window where the primary acked a write and the replica had not yet received it. If the primary dies in that window and you fail over, the write is genuinely lost. Whether you can tolerate that is a product question.

`WAIT N timeout` for synchronous replication

Redis offers a partial answer to the replication-loss problem: WAIT 1 100 after a write tells the client to block until at least 1 replica has acknowledged the write, with a 100 ms timeout. If the primary dies and you fail over to that replica, the write survives. This is "synchronous replication for the writes that matter", and it is the Redis equivalent of Postgres's synchronous_commit = remote_apply. Using it costs you a network round trip per write, so most workloads use it only for the few critical commands (UPI debit, password change) and not for everything.

What to take away

Persistence is the discipline that turns Redis from "a fast cache" into "a fast database". Two mechanisms, two shapes, three configuration knobs that matter:

RDB = periodic full snapshot, fork+COW writes a dense binary dump.rdb, fast restart, lose minutes of data between snapshots.
AOF = log every write in RESP, replay on restart, lose at most one fsync interval (typically 1 second), bigger file, slower replay.
The fsync knob (appendfsync always | everysec | no) sets your real durability budget. everysec is the elbow of the curve and the right answer for ~95 % of deployments.
BGREWRITEAOF keeps the AOF from growing without bound by walking the live keyspace and writing the shortest equivalent script.
Run both in production. AOF for second-grained durability, RDB for fast restart and easy backups (copy dump.rdb to S3 nightly). Modern Redis fuses them into a hybrid format that gives you AOF's freshness with RDB's load speed.

The next chapter scales the picture out from one Redis box to a fleet: replication for read scale-out and failover, Sentinel for automatic primary election, and Cluster for sharding the keyspace across many primaries when one box's RAM is no longer enough.

References

Redis Persistence — official documentation — the canonical reference, covers RDB, AOF, and the hybrid setup with current defaults.
Salvatore Sanfilippo (antirez), "Redis persistence demystified" — the original blog post by Redis's creator explaining RDB and AOF trade-offs in detail.
Redis AOF rewrite — multi-part AOF design — the Redis 7.0 PR that introduced the manifest-based hybrid AOF (base RDB + incremental RESP).
fsync(2) — Linux man page — the syscall every persistent system depends on; understanding its cost explains the appendfsync knob.
Redis Replication — companion documentation covering diskless sync and the WAIT command for partial synchronous replication.
Aphyr, "Jepsen: Redis" — an external durability analysis showing what can and cannot be guaranteed even with AOF and WAIT; required reading before betting money on Redis.

RDB: a periodic full snapshot of memory

AOF: append every write command to a log

BGREWRITEAOF: keeping the log from eating the disk

Combined RDB + AOF: the production-recommended setup

A worked example: an Indian fintech and the 2 AM OOM crash

Restart speed: a back-of-the-envelope

Common confusions

Going deeper: persistence in a replicated world

Diskless replication and the snapshot you do not need

Persistence on the replica vs. the primary

WAIT N timeout for synchronous replication

What to take away

References

`WAIT N timeout` for synchronous replication