Memory · architecture
A million tokens isn’t memory
The frontier-model context window climbed from 4K to 32K to 128K to 1M to 2M tokens in about two years. Each bump gets celebrated as though it’s the end state of “AI memory”. It isn’t. A million tokens is about 700 pages of text — a single shelf of a single bookcase. A working research lab has forty years of papers, lab notebooks, failed experiments, machine logs, sensor archives, supplier data sheets, PhD theses, and a very long Slack history. Orders of magnitude more.
The race for longer context is the wrong race for science. The right target is memory — a persistent, compressible, recall-able substrate separate from the token window — and the right machinery is already well-studied: vector retrieval for exact lookup, and xLSTM-style matrix memory for the compressed, associative kind. Eldric ships both. This post is about the why and the how.
Context isn’t memory
The frontier labs have done extraordinary work on long context. The property they optimise for is every token attends to every other token for the duration of one request. That’s genuinely useful — pasting a whole codebase for a refactor, or a whole contract for review. It’s also, categorically, not the same operation as:
- Recall across sessions. Next week you want the same fact. The context is gone.
- Compression. The archive doesn’t fit in 2M tokens. It won’t fit in 20M. The bigger you make the window, the more you pay per query for the same question.
- Associative recall. “What did we notice about yields near pH 6 in 2019?” doesn’t have a single canonical source; it’s a pattern scattered across a hundred notebooks.
- Per-user scoping. Context leaks. Multi-tenant deployments need memory that respects ownership boundaries even when two users ask overlapping questions.
A long context window is a scratchpad. A memory layer is a cortex. Both useful. Not interchangeable.
The scale problem in one back-of-envelope
| Archive | Rough size | vs 1M tokens |
|---|---|---|
| 1M-token context window | ~700 pages | baseline |
| A single PhD thesis | ~300 pages | 0.4× |
| A lab’s papers over 40 years | 50–200 GB of PDFs | ~100× |
| Institutional email + chat history | 100 GB+ indexed text | ~150× |
| Time-series sensor archive (decadal) | TB scale | >1000× |
| CERN LHC ATLAS + CMS raw data | exabytes | off-chart |
Nobody’s building a 1-exabyte context. The question is what does go in front of the LLM at inference time, and how the archive behind it gets compressed into that window without losing the relevant patterns.
Two kinds of remembering
A scientist asking a question wants both kinds of recall at once. Two examples:
- “Pull up Dr Berger’s 2009 Nature paper on ferrocyanide photolysis.” — exact lookup. The answer is a specific document; the right machinery is indexing + similarity search.
- “What did we learn across all the failed XYZ-2001 syntheses in the last decade?” — compressed, pattern-level recall. The answer isn’t any one document; it’s the trace a thousand experiments left in the institutional memory.
Vector stores (embed + ANN search) are the mature answer to the first. They scale to billions of documents, they preserve exact provenance, and they’re boring in the good sense. Eldric ships one in the data module; that’s not the interesting part of this post.
The second kind — compressed associative recall — is where the architecture gets interesting, and where the recent xLSTM family of work becomes directly applicable as a storage primitive.
Matrix Memory — the idea in one equation
A matrix memory is a single dense matrix M of
shape d × d that absorbs (key, value)
pairs via outer products. Writing is:
Reading is a single matrix-vector product:
That’s the whole loop. α is a decay
gate (what to forget), β is a write gate
(how important this write is). The matrix is a smeared-out
record of every (key, value) pair that came through it, with
the old stuff fading exponentially and the new stuff writing
bright. A query q that resembles a past key
triggers a recall that resembles the value written with that
key — associative recall, the same kind a
person has when a smell drags back a decade-old memory.
The storage cost doesn’t depend on how many things
you’ve written. It’s O(d²),
full stop. Recall is one matrix-vector multiply. You can run
thousands of these per second on a laptop.
The xLSTM connection
The outer-product write / matrix-vector recall pair is exactly the mLSTM cell from Sepp Hochreiter’s group’s xLSTM paper (Beck et al., 2024). xLSTM is an extended LSTM architecture: two cell variants — sLSTM (scalar memory) and mLSTM (matrix memory) — that together recover most of what made classical LSTMs attractive while adding the expressivity that Transformers took from them.
For a Vienna-adjacent AI project like Eldric, this paper is hard to ignore. xLSTM came out of JKU Linz; the underlying insight — that associative matrix memory is a powerful alternative to attention for sequence models — is a short bus ride from here, quite literally.
Eldric takes the mLSTM cell, freezes it as a storage primitive rather than a trained layer, and gives it an on-disk file format. A neural-network building block becomes a database.
v3 mLSTM → v4 Gated DeltaNet
Running mLSTM-style memory as durable storage surfaces a problem the paper version barely notices: a fresh outer product blindly adds to whatever was already written. Two contradictory (key, value) writes both land. Memory utilisation stalls around 60% before saturation noise washes everything out.
Matrix Memory v4 replaces the straight outer-product rule with a Gated DeltaNet error-correcting update:
The write term is now only the delta between the
incoming value and what the memory already returns for that
key. If the memory already knows this fact, v −
M·k is near zero and the write costs almost
nothing. If the fact is new, the delta is large and the
memory corrects toward it. Utilisation climbs past 90%.
v4 is the default in alpha.3’s .emm file
format; v3 files auto-upgrade on load. The whole thing lives
in cpp/src/distributed/data/matrix_memory.cpp
(~900 lines), behind the /api/v1/memory/*
surface on the data module.
Hierarchical sizing — domains get their own matrix
One matrix isn’t enough. A chat log and a particle physics experiment have different memory needs; an 80×80 matrix that works for the chat suffocates on the experiment. Matrix Memory is hierarchical — Domain → Project → Run — with per-domain defaults:
| Domain | Initial rank | Dimension | Use |
|---|---|---|---|
chat | 64 | 768 | conversation memory |
code | 128 | 768 | code patterns |
particle_physics | 512 | 1024 | LHC experiment data |
genomics | 256 | 1024 | DNA / protein sequences |
seismic | 256 | 768 | earthquake patterns |
robotics | 128 | 512 | motor control patterns |
general | 64 | 768 | default |
Ranks auto-expand when saturation crosses a threshold (0.85
by default). The matrices live in the .emm v4
format: 128-byte header with magic + dim + rank + CRC +
SHA-256, 64 KB blocks each with their own CRC32, a
write-ahead journal for crash safety, periodic checkpoints
for disaster recovery.
Consolidation — the dream cycle
The third piece is the one that looks most like actual brain
machinery. The dream module is a background
scheduler that walks a six-phase cycle during idle periods:
ingest → extract → probe → distill → checkpoint → complete
Per user scope (opted in via the sharing.dream
plugin): pull new documents and sensor windows, extract
(key, value) pairs, probe the existing memory to see what
would surprise it, distill the delta into a new write,
checkpoint the matrix, mark the cycle complete. It’s
the scientific equivalent of sleep — an offline pass
that reorganises the day’s input into a form the
next-day query can recall in a single matrix-vector product.
You don’t notice the dream cycle at query time. What you notice is that next week’s “what was going on with ferrocyanide photolysis back then?” returns a coherent summary in milliseconds instead of a cold RAG search with no memory of the three hours you spent on it in May.
Worked example — a 25-year-old failed synthesis
A process chemist asks in chat:
“Did we ever try XYZ-2001 under phase-transfer conditions? I vaguely remember someone had trouble with it in the early 2000s.”
Under the hood:
- The query embeds and hits the vector store across the
chemistrydomain namespace — pulls exact documents with XYZ-2001 and phase transfer in close proximity. - Simultaneously the data module runs
M·qon the lab’s chemistry Matrix Memory. The matrix has been absorbing every notebook entry for 25 years. It returns a dense vector that decodes into associated keys: “tetraethylammonium bromide”, “yield collapse near pH 6”, “2003 fail log, B. Heinz”. - The vector-side snippets and the matrix-side associations merge into one retrieval-context block.
- The LLM answers with the documents and the pattern — citations to the actual 2003 lab-book page, plus the smeared-out memory that “pH 6 killed the yield” is a lesson the lab learned a decade ago.
A 1M-token context wouldn’t have done this. It could’ve read the 2003 lab book in full if you had it pre-selected — but selecting the right 2003 lab book out of 25 years of them is the actual problem, and that’s what the memory layer solves.
What ships in alpha.3
- Matrix Memory v4 (Gated DeltaNet) with v3 auto-upgrade
on load.
.emmfile format, WAL, checkpointing, per-domain sizing. - Vector store with multi-tenant namespaces, embedding provider of your choice (Ollama, OpenAI-compat, local fallback).
- Hybrid search.
retrieval.data.localqueries both in parallel per fan-out call and merges the results — no opt-in required. - Dream cycle wiring: ingest + checkpoint phases; probe and distill land on the Phase-3.5 follow-up.
- Memory API surface:
/api/v1/memory/{health,matrices,checkpoint,verify,store,recall,forget}. - Training Worker supports xLSTM distillation (Transformer → xLSTM) as a three-stage pipeline — the trained-model counterpart to the storage primitive above.
What this means for a lab planning AI infrastructure
Three practical takeaways:
- Don’t buy your AI strategy based on context length. 2M tokens is great for a one-shot refactor. It will not remember anything next week.
- Pair vector retrieval with matrix memory. Exact citations for trust and provenance; compressed associative recall for the institutional patterns that don’t live in any one document. Eldric ships both behind one retrieval call.
- Budget for consolidation. The difference between “we stored everything” and “we remember things” is idle-time compute reorganising the former into the latter. Dream cycles are cheap. Run them.
If you’re a lab, a research group, or an enterprise R&D org looking at 1M-token benchmarks and wondering whether that’s the AI-memory story you should build on: probably not. There’s a better primitive, it’s a short walk from Vienna, and it’s already shipped.