Memory · architecture

A million tokens isn’t memory

by Juergen Paulhart · 2026-04-23 · ~8 min read

The frontier-model context window climbed from 4K to 32K to 128K to 1M to 2M tokens in about two years. Each bump gets celebrated as though it’s the end state of “AI memory”. It isn’t. A million tokens is about 700 pages of text — a single shelf of a single bookcase. A working research lab has forty years of papers, lab notebooks, failed experiments, machine logs, sensor archives, supplier data sheets, PhD theses, and a very long Slack history. Orders of magnitude more.

The race for longer context is the wrong race for science. The right target is memory — a persistent, compressible, recall-able substrate separate from the token window — and the right machinery is already well-studied: vector retrieval for exact lookup, and xLSTM-style matrix memory for the compressed, associative kind. Eldric ships both. This post is about the why and the how.

Context isn’t memory

The frontier labs have done extraordinary work on long context. The property they optimise for is every token attends to every other token for the duration of one request. That’s genuinely useful — pasting a whole codebase for a refactor, or a whole contract for review. It’s also, categorically, not the same operation as:

Recall across sessions. Next week you want the same fact. The context is gone.
Compression. The archive doesn’t fit in 2M tokens. It won’t fit in 20M. The bigger you make the window, the more you pay per query for the same question.
Associative recall. “What did we notice about yields near pH 6 in 2019?” doesn’t have a single canonical source; it’s a pattern scattered across a hundred notebooks.
Per-user scoping. Context leaks. Multi-tenant deployments need memory that respects ownership boundaries even when two users ask overlapping questions.

A long context window is a scratchpad. A memory layer is a cortex. Both useful. Not interchangeable.

The scale problem in one back-of-envelope

Archive	Rough size	vs 1M tokens
1M-token context window	~700 pages	baseline
A single PhD thesis	~300 pages	0.4×
A lab’s papers over 40 years	50–200 GB of PDFs	~100×
Institutional email + chat history	100 GB+ indexed text	~150×
Time-series sensor archive (decadal)	TB scale	>1000×
CERN LHC ATLAS + CMS raw data	exabytes	off-chart

Nobody’s building a 1-exabyte context. The question is what does go in front of the LLM at inference time, and how the archive behind it gets compressed into that window without losing the relevant patterns.

Two kinds of remembering

A scientist asking a question wants both kinds of recall at once. Two examples:

“Pull up Dr Berger’s 2009 Nature paper on ferrocyanide photolysis.” — exact lookup. The answer is a specific document; the right machinery is indexing + similarity search.
“What did we learn across all the failed XYZ-2001 syntheses in the last decade?” — compressed, pattern-level recall. The answer isn’t any one document; it’s the trace a thousand experiments left in the institutional memory.

Vector stores (embed + ANN search) are the mature answer to the first. They scale to billions of documents, they preserve exact provenance, and they’re boring in the good sense. Eldric ships one in the data module; that’s not the interesting part of this post.

The second kind — compressed associative recall — is where the architecture gets interesting, and where the recent xLSTM family of work becomes directly applicable as a storage primitive.

Matrix Memory — the idea in one equation

A matrix memory is a single dense matrix M of shape d × d that absorbs (key, value) pairs via outer products. Writing is:

M ← α·M + β·(v ⊗ k)

Reading is a single matrix-vector product:

output = M · q

That’s the whole loop. α is a decay gate (what to forget), β is a write gate (how important this write is). The matrix is a smeared-out record of every (key, value) pair that came through it, with the old stuff fading exponentially and the new stuff writing bright. A query q that resembles a past key triggers a recall that resembles the value written with that key — associative recall, the same kind a person has when a smell drags back a decade-old memory.

The storage cost doesn’t depend on how many things you’ve written. It’s O(d²), full stop. Recall is one matrix-vector multiply. You can run thousands of these per second on a laptop.

The xLSTM connection

The outer-product write / matrix-vector recall pair is exactly the mLSTM cell from Sepp Hochreiter’s group’s xLSTM paper (Beck et al., 2024). xLSTM is an extended LSTM architecture: two cell variants — sLSTM (scalar memory) and mLSTM (matrix memory) — that together recover most of what made classical LSTMs attractive while adding the expressivity that Transformers took from them.

For a Vienna-adjacent AI project like Eldric, this paper is hard to ignore. xLSTM came out of JKU Linz; the underlying insight — that associative matrix memory is a powerful alternative to attention for sequence models — is a short bus ride from here, quite literally.

Eldric takes the mLSTM cell, freezes it as a storage primitive rather than a trained layer, and gives it an on-disk file format. A neural-network building block becomes a database.

v3 mLSTM → v4 Gated DeltaNet

Running mLSTM-style memory as durable storage surfaces a problem the paper version barely notices: a fresh outer product blindly adds to whatever was already written. Two contradictory (key, value) writes both land. Memory utilisation stalls around 60% before saturation noise washes everything out.

Matrix Memory v4 replaces the straight outer-product rule with a Gated DeltaNet error-correcting update:

M ← α·M + β·(v − M·k) ⊗ k

The write term is now only the delta between the incoming value and what the memory already returns for that key. If the memory already knows this fact, v − M·k is near zero and the write costs almost nothing. If the fact is new, the delta is large and the memory corrects toward it. Utilisation climbs past 90%.

v4 is the default in alpha.3’s .emm file format; v3 files auto-upgrade on load. The whole thing lives in cpp/src/distributed/data/matrix_memory.cpp (~900 lines), behind the /api/v1/memory/* surface on the data module.

Hierarchical sizing — domains get their own matrix

One matrix isn’t enough. A chat log and a particle physics experiment have different memory needs; an 80×80 matrix that works for the chat suffocates on the experiment. Matrix Memory is hierarchical — Domain → Project → Run — with per-domain defaults:

Domain	Initial rank	Dimension	Use
`chat`	64	768	conversation memory
`code`	128	768	code patterns
`particle_physics`	512	1024	LHC experiment data
`genomics`	256	1024	DNA / protein sequences
`seismic`	256	768	earthquake patterns
`robotics`	128	512	motor control patterns
`general`	64	768	default

Ranks auto-expand when saturation crosses a threshold (0.85 by default). The matrices live in the .emm v4 format: 128-byte header with magic + dim + rank + CRC + SHA-256, 64 KB blocks each with their own CRC32, a write-ahead journal for crash safety, periodic checkpoints for disaster recovery.

Consolidation — the dream cycle

The third piece is the one that looks most like actual brain machinery. The dream module is a background scheduler that walks a six-phase cycle during idle periods:

ingest → extract → probe → distill → checkpoint → complete

Per user scope (opted in via the sharing.dream plugin): pull new documents and sensor windows, extract (key, value) pairs, probe the existing memory to see what would surprise it, distill the delta into a new write, checkpoint the matrix, mark the cycle complete. It’s the scientific equivalent of sleep — an offline pass that reorganises the day’s input into a form the next-day query can recall in a single matrix-vector product.

You don’t notice the dream cycle at query time. What you notice is that next week’s “what was going on with ferrocyanide photolysis back then?” returns a coherent summary in milliseconds instead of a cold RAG search with no memory of the three hours you spent on it in May.

Worked example — a 25-year-old failed synthesis

A process chemist asks in chat:

“Did we ever try XYZ-2001 under phase-transfer conditions? I vaguely remember someone had trouble with it in the early 2000s.”

Under the hood:

The query embeds and hits the vector store across the chemistry domain namespace — pulls exact documents with XYZ-2001 and phase transfer in close proximity.
Simultaneously the data module runs M·q on the lab’s chemistry Matrix Memory. The matrix has been absorbing every notebook entry for 25 years. It returns a dense vector that decodes into associated keys: “tetraethylammonium bromide”, “yield collapse near pH 6”, “2003 fail log, B. Heinz”.
The vector-side snippets and the matrix-side associations merge into one retrieval-context block.
The LLM answers with the documents and the pattern — citations to the actual 2003 lab-book page, plus the smeared-out memory that “pH 6 killed the yield” is a lesson the lab learned a decade ago.

A 1M-token context wouldn’t have done this. It could’ve read the 2003 lab book in full if you had it pre-selected — but selecting the right 2003 lab book out of 25 years of them is the actual problem, and that’s what the memory layer solves.

What ships in alpha.3

Matrix Memory v4 (Gated DeltaNet) with v3 auto-upgrade on load. .emm file format, WAL, checkpointing, per-domain sizing.
Vector store with multi-tenant namespaces, embedding provider of your choice (Ollama, OpenAI-compat, local fallback).
Hybrid search. retrieval.data.local queries both in parallel per fan-out call and merges the results — no opt-in required.
Dream cycle wiring: ingest + checkpoint phases; probe and distill land on the Phase-3.5 follow-up.
Memory API surface: /api/v1/memory/{health,matrices,checkpoint,verify,store,recall,forget}.
Training Worker supports xLSTM distillation (Transformer → xLSTM) as a three-stage pipeline — the trained-model counterpart to the storage primitive above.

What this means for a lab planning AI infrastructure

Three practical takeaways:

Don’t buy your AI strategy based on context length. 2M tokens is great for a one-shot refactor. It will not remember anything next week.
Pair vector retrieval with matrix memory. Exact citations for trust and provenance; compressed associative recall for the institutional patterns that don’t live in any one document. Eldric ships both behind one retrieval call.
Budget for consolidation. The difference between “we stored everything” and “we remember things” is idle-time compute reorganising the former into the latter. Dream cycles are cheap. Run them.

If you’re a lab, a research group, or an enterprise R&D org looking at 1M-token benchmarks and wondering whether that’s the AI-memory story you should build on: probably not. There’s a better primitive, it’s a short walk from Vienna, and it’s already shipped.

Install alpha.3 Data access article Module terminology

#ScientificAI #xLSTM #MatrixMemory #DeltaNet #Hochreiter #AssociativeMemory #RAG #Eldric