Neo4j internals and page-cache tuning

In short

Neo4j on disk is four fixed-size record files plus a property store: neostore.nodestore.db (15 bytes per node — in-use flag, first-edge offset, first-property offset, label info), neostore.relationshipstore.db (34 bytes per edge — in-use, type id, src node, dst node, four prev/next pointers for the doubly linked adjacency lists, first-property offset), neostore.propertystore.db (variable-length, chained for big values), and the small token stores that intern label and relationship-type names as short integers. Fixed-size means record n lives at offset n × record_size — no index needed to find it, which is the whole basis of index-free adjacency. Above the files sits the page cache, Neo4js buffer pool, sized by dbms.memory.pagecache.size. Every traversal first asks the page cache "do you have the page containing this record?". If yes (the goal: above **95 percent hit rate** on hot data), the answer is sub-microsecond. If no, Neo4j faults the page in from SSD, paying **5–10 ms** per page miss — five thousand times slower. Almost every "Neo4j is slow" report in production traces to a page cache too small for the working set. The fix is sizing: heap (dbms.memory.heap.max_size) plus page cache should sum to about **70 percent of system RAM**, with the page cache the larger share for read-heavy workloads. Above the cache sits the Cypher engine: parse → logical plan → cost-based optimisation (which uses index statistics to choose seeks over scans) → execution → result streaming. Four tuning moves dominate everything else: (1) size the page cache to cover the hot working set, (2) add schema indexes (CREATE INDEX FOR (u:User) ON (u.account_id)) so MATCH patterns can find their *starting* nodes in O(log N), (3) run PROFILE on slow queries to spot Cartesian products and missing index seeks, (4) use index hints (USING INDEX) when the planner picks a bad path. The worked example tunes a fraud-detection query at an Indian fintech with 50 million UPI users from 5 seconds (default config, full data scan, Cartesian product) down to 50 ms (32 GB page cache covering the dataset, label index on User(account_id)`, fixed query plan) — a 100× win with zero changes to the algorithm. Real-world deployments span fraud detection at HSBC, recommendations at Walmart, and knowledge graphs at NASA, all riding on the same four primitives.

The previous chapter showed why relational engines cant keep up with graph workloads — every hop is another self-join, another B-tree probe, another 100 µs. [Two chapters back](/wiki/native-adjacency-storage-and-index-free-adjacency) showed Neo4js answer: store the adjacency as direct file-offset pointers inside the node and relationship records, drop the per-hop cost to about 1 µs, and let it compound. That story explains why Neo4j is fast in principle. This chapter explains what makes it fast or slow in production: the on-disk storage layout in detail, the page cache that mediates every read, the Cypher query lifecycle from parse to result, and the four tuning levers that dominate every other consideration. By the end you should be able to look at a slow Neo4j query and reason about whether its a page-cache miss, a missing index, or a planner mistake — the three failure modes that cover almost everything youll see in the wild.

The thesis: storage layout determines the ceiling, page cache determines the floor

Two numbers run this whole chapter. The first is 1 microsecond — the time it takes to follow a relationship pointer when both the node record and the relationship record are already in RAM. The second is 5–10 milliseconds — the time it takes to fault one of those records in from SSD when it isn`t. The ratio is roughly 10,000×. Why this ratio dominates everything: a fraud-ring query that visits a thousand edges takes 1 ms when every page is cached and 5–10 seconds when none of them are. Same query, same data, same code path — the difference is purely whether the bytes were in memory or on disk. Tuning Neo4j is, almost entirely, tuning that hit rate up.

The storage layout sets the ceiling — how fast a hot query can possibly be — by making record lookups O(1) file-offset multiplications and adjacency walks pure pointer chases. The page cache sets the floor — how slow a cold query has to be — by determining what fraction of the working set lives in RAM. A perfectly designed query language and a brilliant planner cannot rescue you from a 50 GB working set crammed into a 4 GB page cache; they also cannot help much if the storage layout forced you to probe an index for every hop. Neo4j gets the storage layout right by design (you cannot really mis-configure index-free adjacency); the page cache, you have to size yourself.

The on-disk layout: four record files plus property chains

Open a Neo4j data directory and you find a single subdirectory per database. Inside, ignoring transaction logs and the schema store, the files that matter are these:

neo4j/data/databases/upi-fraud/
├── neostore.nodestore.db                  # fixed-size node records
├── neostore.relationshipstore.db          # fixed-size relationship records
├── neostore.propertystore.db              # variable-length property chains
├── neostore.labeltokenstore.db            # interned label names
├── neostore.relationshiptypetokenstore.db # interned relationship-type names
└── neostore.propertykeytokenstore.db      # interned property-key names

Every one of those files except the property store is a sequence of fixed-size records. That is the architectural invariant. Fixed-size means record n lives at byte offset n × record_size and the file system finds it in a single offset multiplication — no B-tree, no hash table, no index. Why fixed-size is non-negotiable: index-free adjacency depends on being able to compute the address of any node or relationship from its ID alone. If records were variable-length, the engine would need a side index to map IDs to offsets, and every hop would pay an index probe — exactly the cost native graph databases were designed to avoid.

The node record is 15 bytes. Inside those 15 bytes Neo4j packs: a one-byte in-use flag (so deleted nodes can be reclaimed), a four-byte pointer to the first relationship in this node`s adjacency list, a four-byte pointer to the first property in its property chain, a five-byte field encoding label information (compact for nodes with one or two labels, with an overflow into the dynamic label store for nodes with many), and a one-byte set of flags for things like whether this is a "dense" node that uses the relationship-group store. Reading a node is one page-aligned 15-byte read, decoded inline.

The relationship record is 34 bytes — bigger because it carries more pointers. A one-byte in-use flag. A four-byte type ID (an integer indexing into neostore.relationshiptypetokenstore.db). Two four-byte node IDs (source and destination). Four four-byte pointers: the next relationship in the sources adjacency list, the previous one in the sources list, the next one in the destinations list, the previous one in the destinations list. And a four-byte first-property pointer. The four prev/next pointers are the key — they make every relationship a member of two doubly linked lists, one threaded through each endpoint, which is what lets you traverse outward from either end in O(degree) without ever consulting an index.

Four files do almost all the work. The two stores on the left are fixed-size record arrays — record IDs are file offsets, no index needed. The property store on the right is variable-length and chained: each property points to the next, and large values (long strings, arrays) overflow into a separate dynamic store. The token store at the far right is the trick that keeps relationship and label *names* out of the hot record files: every type name is interned once and referenced by a four-byte integer everywhere else, which is what makes a 34-byte relationship record possible at all.

The property store is where variable-length lives. Each property record holds a key ID (an integer indexing into neostore.propertykeytokenstore.db), a type tag, an inline value (for small ints, booleans, short strings) or a pointer to the dynamic store (for long strings, arrays, large blobs), and a pointer to the next property in the chain. Every node and every relationship has a first_property pointer; reading "all the properties of node 42" walks the linked list from there. Why properties get the linked-list treatment: properties are the variable part of any graph schema — some nodes have three, some have thirty, some have a 50 KB JSON blob. Storing them in a separate file with chaining keeps the node and relationship records fixed-size (preserving index-free adjacency) while still letting properties grow without bound. The cost is one extra page hit per property read, which is why Cypher patterns that read many properties of every node visited are slower than patterns that traverse-only.

The token stores are the interning trick. A relationship type like PAID is a string, but Neo4j stores it once in neostore.relationshiptypetokenstore.db with a small integer ID, and every relationship record refers to the type by that integer. This is why a 34-byte relationship record can fit a relationship-type field at all — it`s a four-byte int, not a variable-length string. The same trick is used for label names and property-key names. The token stores are tiny (a few KB at most), so they live in the page cache permanently and add zero per-query cost.

The page cache: where graphs become fast or slow

Sitting on top of those files is the page cache, Neo4js buffer pool. It is a large region of off-heap memory (so the JVM garbage collector doesnt scan it) holding 8 KB pages from the storage files. Every read and write goes through it. When the engine wants to read node 42, it computes the file offset (42 × 15 = 630), works out which 8 KB page that falls into, and asks the page cache: do you have it? If yes — a cache hit — the page is already in RAM and the read costs hundreds of nanoseconds. If no — a cache miss — Neo4j faults the page in from SSD, which costs 5–10 ms (most of which is SSD latency, the rest is OS overhead and TLB churn).

The memory split is the single most important tuning decision. On a 64 GB box, a typical read-heavy graph deployment runs 12 GB of JVM heap and 32 GB of page cache, leaving 20 GB for the OS, transaction logs, and headroom. The page cache is sized to fit the *hot working set* — the subset of nodes, relationships, and properties that queries actually touch. When that ratio is right, hit rate climbs above 95 percent and queries land in the millisecond range. When the working set spills, hit rate falls and tail latency explodes — every query that misses pays the 10,000× penalty.

Three configuration knobs control the split. dbms.memory.heap.initial_size and dbms.memory.heap.max_size set the JVM heap — Neo4j wants these equal so the heap doesnt resize. dbms.memory.pagecache.size sets the off-heap page cache directly. The rule of thumb is: **heap + page cache ≈ 70 percent of system RAM**, with the remainder for OS, file cache, network buffers, and other processes. Within that budget, the heap holds the query planner, executor state, transaction state, and result buffers — typically 8–16 GB is enough; bigger is rarely useful and incurs longer GC pauses. The page cache gets everything else. <span class="why">Why the heap shouldnt be larger than necessary: the JVM`s G1 garbage collector scans live objects on the heap; bigger heaps mean longer GC pauses, and a 30-second pause during a Black Friday traffic spike is exactly the kind of thing that takes down a production cluster. Off-heap memory in the page cache is invisible to the GC, which is why Neo4j puts the bulk of its working set there.

Sizing the page cache correctly means estimating the working set: the union of pages your hottest queries touch. The crude estimate is (node_count × 15) + (rel_count × 34) + property_overhead, but thats the *total* size — the *hot* working set is usually 10–30 percent of that on most workloads. The empirical method is better: start with a conservative size, watch the dbms.page_cache.hit_ratio` metric in production, and grow the cache until the hit rate stabilises above 95 percent. Below 95 percent, you are paying SSD latency on too many reads and your tail latency will be miserable. Above 99.9 percent, you have over-provisioned — the marginal extra RAM would be better spent elsewhere.

The query lifecycle: parse → plan → execute

A Cypher query goes through five stages, and understanding which stage each tuning lever affects helps you reach for the right one.

The first three stages cost microseconds total — parsing and planning are background noise unless you do them on every request (which is why Neo4j caches plans by query template). The fourth stage is where the run-time of your query lives, and almost all of *that* time is spent waiting on the page cache. Index seeks help by making stage 3 pick a cheaper plan that visits fewer records in stage 4. Page cache sizing helps by making each record read in stage 4 a hit instead of a miss. Avoiding Cartesian products helps by stopping stage 2 from generating a plan that produces a billion intermediate rows for stage 4 to filter.

Stage 1 — parse. The Cypher parser turns text into an AST. Cheap: tens to hundreds of microseconds. Neo4j caches parsed ASTs by query string, so identical queries (or queries that match a parameterised template) skip this entirely.

Stage 2 — logical planning. The planner takes the AST and produces a logical plan: which patterns to match, in which order, with what filters. This is the stage where Cartesian products get inserted if your MATCH patterns arent connected. A query like MATCH (a:User), (b:Merchant) RETURN a, bproduces a Cartesian — every user paired with every merchant — because nothing connectsaandb. With 50 million users and 5 million merchants, thats 250 trillion rows, and your query never returns. The fix is always to connect the patterns: MATCH (a:User)-[:PAID]->(b:Merchant).

Stage 3 — cost-based optimisation. This is where indexes pay off. The planner uses statistics about each label and each indexed property to estimate the cost of each candidate plan. For MATCH (u:User {account_id: 1042}), the planner has two choices: scan every :User node and filter by account_id (cost: 50 million record reads), or use a schema index on User(account_id) to seek directly to the matching node (cost: log₂(50 million) ≈ 26 page reads). Without the index, only the first option exists. With the index, the planner picks the second automatically. Why this is the single biggest tuning lever after page cache sizing: a query that touches one node directly via an index seek touches a few hundred bytes; the same query without an index scans the entire :User label store, which on a 50-million-user dataset means scanning a couple of gigabytes. Even with a perfectly warm page cache, the second is a thousand times slower than the first.

Stage 4 — execution. The execution engine walks the plan, reading records from the page cache as needed. Every node-record read, every relationship-record read, every property-chain walk goes through the cache. This is the stage that takes most of the wall-clock time, and most of that time is page-cache miss latency.

Stage 5 — stream results. Results are streamed back to the client cursor as they`re produced. The result buffer lives on the JVM heap; oversized result sets pressure the heap and can trigger GC.

Two diagnostic commands matter. EXPLAIN <query> shows the plan without executing — use it to check what the planner intends to do. PROFILE <query> runs the query and annotates each plan operator with the number of rows produced and the number of database hits — use it to see what the planner actually did. The two together catch most performance bugs: EXPLAIN shows you the plan looks reasonable, then PROFILE reveals that one operator did 10 million more hits than expected because a row-count estimate was wrong.

The four tuning levers, ranked by impact

In rough order of how often they matter:

Size the page cache to cover the working set. This dominates everything. A 32 GB page cache on a 30 GB working set will sustain 95 percent+ hit rate and millisecond query latency; a 4 GB cache on the same data will sustain 30 percent hit rate and second-scale tail latency. Configure with dbms.memory.pagecache.size=32g, monitor with dbms.page_cache.hit_ratio.
Add schema indexes for MATCH starting points. Every Cypher pattern starts at one or more anchor nodes — the bound nodes the engine begins traversal from. Without an index, finding those anchors means scanning the label store. With an index, its a B-tree seek. CREATE INDEX user_account_idx FOR (u:User) ON (u.account_id)is usually the first DDL you write. Multi-property and composite indexes exist for compound predicates. Full-text indexes (Lucene-backed, created viaCREATE FULLTEXT INDEX) handle CONTAINS` and tokenised text search.
Use PROFILE to spot Cartesian products and bad plans. A Cartesian product in a MATCH clause is almost always a bug — two patterns with no connecting relationship. PROFILE shows it as a CartesianProduct operator with a row-count that explodes. Also watch for NodeByLabelScan where you expected NodeIndexSeek — that means the planner couldn`t find a usable index, which usually means you forgot to create one.
Use index hints when the planner picks wrong. Cost-based planners are good but not perfect. When statistics are stale or the data distribution is weird, the planner sometimes picks a label scan over an index seek, or picks the wrong index for a multi-indexed property. USING INDEX u:User(account_id) forces the planners hand. Use sparingly — every hint is a maintenance burden if the data shape changes — but theyre the right tool when youve PROFILEd a query and know the planner is wrong.

A fifth lever sits below all of these in the operations manual but rarely needs touching: transaction log retention, checkpoint frequency, and parallel-runtime settings for large analytical queries. They matter for write-heavy or analytical workloads but not for the common read-heavy fraud/recommendation case.

Tuning Neo4j for fraud detection at an Indian fintech

A fintech operating UPI and credit-card payments runs Neo4j as the traversal layer of its fraud-detection stack. The graph has 50 million :User nodes (one per KYCd account), 20 million :Merchantnodes, and 8 billion:PAIDand:REFERRED` relationships covering the last 18 months of activity. Total on-disk size: about 280 GB. The fraud team has a Cypher query they run on every flagged transaction:

MATCH (u:User {account_id: $aid})-[:PAID*1..3]->(other:User)
WHERE other.kyc_flag = 'suspicious'
RETURN DISTINCT other, length(path) AS hops
ORDER BY hops

The hot working set — the subset of users and edges actually touched in the last seven days of fraud queries — comes out to about 30 GB after measurement (a few percent of the total graph, weighted toward recent activity).

Day zero: default config. The engineer who deployed Neo4j took the defaults: dbms.memory.heap.max_size=4g, dbms.memory.pagecache.size=4g. The query takes 5 seconds end to end. Most of that time is page faults — the page cache holds 4 GB of the 30 GB working set, hit rate is sitting at 28 percent, and each missed read costs 7 ms on the SSD. PROFILE confirms it: the Expand(All) operator shows 4.2 million page hits, and the page-cache metrics show 3 million of them as misses.

Step 1: size the page cache. The host has 64 GB of RAM. The engineer sets dbms.memory.heap.max_size=12g and dbms.memory.pagecache.size=32g, restarts, and lets the cache warm up over the next ten minutes by running the most common queries. After warmup, hit rate climbs to 96 percent. The same query now takes 800 ms — a 6× win, purely from getting the working set into RAM.

Step 2: add a schema index. PROFILE on the warmer query reveals a NodeByLabelScan(User) operator finding the anchor u. The MATCH starts at the user with a specific account_id, but with no index, Neo4j scans every :User node looking for the match. Adding CREATE INDEX user_aid FOR (u:User) ON (u.account_id) and re-running PROFILE shows the operator change to NodeIndexSeek(User, account_id), and the row count for that operator drops from 50 million to 1. Query time falls to 150 ms — another 5× win.

Step 3: spot the Cartesian. A second, related query the team runs is "find all merchants that received money from this user`s 2-hop fraud cluster":

MATCH (u:User {account_id: $aid})-[:PAID*1..2]->(other:User), (m:Merchant)
WHERE (other)-[:PAID]->(m)
RETURN DISTINCT m

The two MATCH patterns are not connected — m is bound separately from u and other. PROFILE shows a CartesianProduct operator generating 14 trillion rows, then filtering. The query never actually returns; it gets killed by the transaction timeout. The fix is to fold both into a single connected pattern:

MATCH (u:User {account_id: $aid})-[:PAID*1..2]->(other:User)-[:PAID]->(m:Merchant)
RETURN DISTINCT m

Now the planner generates a clean traversal plan, no Cartesian. The query runs in 80 ms.

Step 4: warm cache and measure tail latency. With the page cache sized, the index in place, and the Cartesian removed, the original three-hop query lands at 50 ms at the median and 180 ms at p99. The team wires it into the live UPI-approval path: every transaction over ₹50,000 triggers the three-hop fraud expansion before approval, in real time. The 100× improvement (5 s → 50 ms) comes entirely from operations work — no change to the algorithm, no change to the data model, no change to the application code. Just sizing memory correctly, adding the right index, and removing one accidental Cartesian.

The teams monitoring now tracks three numbers: dbms.page_cache.hit_ratio(alert below 92 percent), p99 query latency by query template (alert above 200 ms), and the count ofCartesianProduct operators across all queries (alert if any new one appears, since its usually a regression introduced by a developer who didn`t connect their MATCH patterns).

Real-world deployments

Three production stacks worth knowing.

Fraud detection at HSBC. HSBCs anti-money-laundering platform uses Neo4j to model the transaction graph across the banks retail and corporate businesses, executing multi-hop ring-detection queries similar to the worked example above. Public talks describe it as the "central nervous system" of their AML programme. The page cache is sized to cover the rolling 90-day transaction window; older data is offloaded to Hadoop for batch analytics.

Recommendations at Walmart. Walmart`s online recommendation engine uses Neo4j to model the products-bought-with-products graph from years of order history. The traversal is two-hop: starting from the items in your current cart, find products other customers bought alongside the same items, weighted by recency. The page cache holds the entire active product catalogue and the most recent year of orders.

Knowledge graphs at NASA. NASA`s "lessons learned" knowledge graph uses Neo4j to link engineers, projects, components, and incident reports across decades of mission data. Traversals like "find all engineers who worked on the same component class as the one that failed in incident X" let mission planners surface relevant prior experience. The graph is small by web-scale standards (a few hundred million nodes) but the relationships are dense and the queries are deep, which is exactly the workload index-free adjacency was designed for.

The takeaway

Neo4js performance story has two layers. The storage layout — fixed-size record files with embedded adjacency pointers — makes the *ceiling* high: every traversal is theoretically a sub-microsecond pointer chase. The page cache decides how close you get to that ceiling by determining whether each pointer dereference hits a page in RAM or faults to SSD. Get the cache size right, add indexes for the anchor nodes, and use PROFILE to weed out Cartesians, and the engine will return three-hop fraud queries in tens of milliseconds against datasets that scale to billions of edges. Skip those steps and youll see the same queries take seconds against the same data, on the same hardware, running the same engine. The difference is configuration, not algorithm.

This closes Build 20. You started this build with the data model — what a property graph is — and ended with the operational details of running the most widely deployed engine that implements it. The path was data model → storage layout → traversal primitives → query languages → relational comparison → engine internals. Build 21 goes to time-series databases, where the design pressures are completely different (sequential writes, time-bucketed reads, retention policies) and the engineering trade-offs flip the storage layout in ways that mirror Neo4j only in being purpose-built for the workload they serve.

References

Neo4j Operations Manual, Database internals — neo4j.com/docs/operations-manual/current/database-internals/. Definitive on-disk record layout and storage engine reference.
Neo4j Operations Manual, Memory configuration — neo4j.com/docs/operations-manual/current/performance/memory-configuration/. The official tuning guide for heap and page cache sizing.
Neo4j Cypher Manual, Query tuning and execution plans — neo4j.com/docs/cypher-manual/current/planning-and-tuning/. EXPLAIN, PROFILE, index hints, and the cost-based planner.
Partner, Vukotic, Watt, Neo4j in Action (Manning, 2014) — manning.com/books/neo4j-in-action. The canonical book on operating and querying Neo4j in production.
Robinson, Webber, Eifrem, Graph Databases (2nd ed., OReilly, 2015), chapters 6–7 — [graphdatabases.com](https://graphdatabases.com/). Storage internals and query patterns from Neo4js founders.
Neo4j Knowledge Base, Page cache hit ratio and what it means — support.neo4j.com/. Practical interpretation of the hit-ratio metric and warmup strategies.