Memory · architecture

A million tokens isn’t memory

by Juergen Paulhart · 2026-04-23 · ~8 min read

The frontier-model context window climbed from 4K to 32K to 128K to 1M to 2M tokens in about two years. Each bump gets celebrated as though it’s the end state of “AI memory”. It isn’t. A million tokens is about 700 pages of text — a single shelf of a single bookcase. A working research lab has forty years of papers, lab notebooks, failed experiments, machine logs, sensor archives, supplier data sheets, PhD theses, and a very long Slack history. Orders of magnitude more.

The race for longer context is the wrong race for science. The right target is memory — a persistent, compressible, recall-able substrate separate from the token window — and the right machinery is already well-studied: vector retrieval for exact lookup, and xLSTM-style matrix memory for the compressed, associative kind. Eldric ships both. This post is about the why and the how.

SCIENTIFIC ARCHIVE ELDRIC MEMORY LAYER QUERY Papers · 40 yrs PDFs · preprints Lab notebooks ELN + paper · 25 yrs Failed experiments procedural · negative results Sensor archives time-series · decades Institutional lore emails · chats · theses total: petabytes LLM context window — 1M–2M tokens ~700 pages · ephemeral · quadratic attention cost · dropped at end of turn the brute-force answer — fine for a single session, useless as durable memory data module — memory layer persists across sessions · survives restarts · per-user scoped Vector store exact retrieval embed · ANN search O(N·d) storage, O(log N) recall “show me the paper” Matrix Memory xLSTM-inspired · v4 DeltaNet outer-product write, matrix-vector recall O(d²) state, constant-time recall “what happened around compound X?” Dream cycle — background consolidation ingest → extract → probe → distill → checkpoint → complete Idle-time pass over the archive. Writes patterns into Matrix Memory, keeps exact references in the vector store. The scientific equivalent of sleep. Scientist "tell me about compound XYZ- 2001 failures" answer + citations across 25 yrs

Context isn’t memory

The frontier labs have done extraordinary work on long context. The property they optimise for is every token attends to every other token for the duration of one request. That’s genuinely useful — pasting a whole codebase for a refactor, or a whole contract for review. It’s also, categorically, not the same operation as:

A long context window is a scratchpad. A memory layer is a cortex. Both useful. Not interchangeable.

The scale problem in one back-of-envelope

ArchiveRough sizevs 1M tokens
1M-token context window~700 pagesbaseline
A single PhD thesis~300 pages0.4×
A lab’s papers over 40 years50–200 GB of PDFs~100×
Institutional email + chat history100 GB+ indexed text~150×
Time-series sensor archive (decadal)TB scale>1000×
CERN LHC ATLAS + CMS raw dataexabytesoff-chart

Nobody’s building a 1-exabyte context. The question is what does go in front of the LLM at inference time, and how the archive behind it gets compressed into that window without losing the relevant patterns.

Two kinds of remembering

A scientist asking a question wants both kinds of recall at once. Two examples:

Vector stores (embed + ANN search) are the mature answer to the first. They scale to billions of documents, they preserve exact provenance, and they’re boring in the good sense. Eldric ships one in the data module; that’s not the interesting part of this post.

The second kind — compressed associative recall — is where the architecture gets interesting, and where the recent xLSTM family of work becomes directly applicable as a storage primitive.

Matrix Memory — the idea in one equation

A matrix memory is a single dense matrix M of shape d × d that absorbs (key, value) pairs via outer products. Writing is:

M  ←  α·M  +  β·(v ⊗ k)

Reading is a single matrix-vector product:

output  =  M · q

That’s the whole loop. α is a decay gate (what to forget), β is a write gate (how important this write is). The matrix is a smeared-out record of every (key, value) pair that came through it, with the old stuff fading exponentially and the new stuff writing bright. A query q that resembles a past key triggers a recall that resembles the value written with that key — associative recall, the same kind a person has when a smell drags back a decade-old memory.

The storage cost doesn’t depend on how many things you’ve written. It’s O(d²), full stop. Recall is one matrix-vector multiply. You can run thousands of these per second on a laptop.

The xLSTM connection

The outer-product write / matrix-vector recall pair is exactly the mLSTM cell from Sepp Hochreiter’s group’s xLSTM paper (Beck et al., 2024). xLSTM is an extended LSTM architecture: two cell variants — sLSTM (scalar memory) and mLSTM (matrix memory) — that together recover most of what made classical LSTMs attractive while adding the expressivity that Transformers took from them.

For a Vienna-adjacent AI project like Eldric, this paper is hard to ignore. xLSTM came out of JKU Linz; the underlying insight — that associative matrix memory is a powerful alternative to attention for sequence models — is a short bus ride from here, quite literally.

Eldric takes the mLSTM cell, freezes it as a storage primitive rather than a trained layer, and gives it an on-disk file format. A neural-network building block becomes a database.

v3 mLSTM → v4 Gated DeltaNet

Running mLSTM-style memory as durable storage surfaces a problem the paper version barely notices: a fresh outer product blindly adds to whatever was already written. Two contradictory (key, value) writes both land. Memory utilisation stalls around 60% before saturation noise washes everything out.

Matrix Memory v4 replaces the straight outer-product rule with a Gated DeltaNet error-correcting update:

M  ←  α·M  +  β·(v − M·k) ⊗ k

The write term is now only the delta between the incoming value and what the memory already returns for that key. If the memory already knows this fact, v − M·k is near zero and the write costs almost nothing. If the fact is new, the delta is large and the memory corrects toward it. Utilisation climbs past 90%.

v4 is the default in alpha.3’s .emm file format; v3 files auto-upgrade on load. The whole thing lives in cpp/src/distributed/data/matrix_memory.cpp (~900 lines), behind the /api/v1/memory/* surface on the data module.

Hierarchical sizing — domains get their own matrix

One matrix isn’t enough. A chat log and a particle physics experiment have different memory needs; an 80×80 matrix that works for the chat suffocates on the experiment. Matrix Memory is hierarchical — Domain → Project → Run — with per-domain defaults:

DomainInitial rankDimensionUse
chat 64 768 conversation memory
code 128 768 code patterns
particle_physics 512 1024 LHC experiment data
genomics 256 1024 DNA / protein sequences
seismic 256 768 earthquake patterns
robotics 128 512 motor control patterns
general 64 768 default

Ranks auto-expand when saturation crosses a threshold (0.85 by default). The matrices live in the .emm v4 format: 128-byte header with magic + dim + rank + CRC + SHA-256, 64 KB blocks each with their own CRC32, a write-ahead journal for crash safety, periodic checkpoints for disaster recovery.

Consolidation — the dream cycle

The third piece is the one that looks most like actual brain machinery. The dream module is a background scheduler that walks a six-phase cycle during idle periods:

ingest → extract → probe → distill → checkpoint → complete

Per user scope (opted in via the sharing.dream plugin): pull new documents and sensor windows, extract (key, value) pairs, probe the existing memory to see what would surprise it, distill the delta into a new write, checkpoint the matrix, mark the cycle complete. It’s the scientific equivalent of sleep — an offline pass that reorganises the day’s input into a form the next-day query can recall in a single matrix-vector product.

You don’t notice the dream cycle at query time. What you notice is that next week’s “what was going on with ferrocyanide photolysis back then?” returns a coherent summary in milliseconds instead of a cold RAG search with no memory of the three hours you spent on it in May.

Worked example — a 25-year-old failed synthesis

A process chemist asks in chat:

“Did we ever try XYZ-2001 under phase-transfer conditions? I vaguely remember someone had trouble with it in the early 2000s.”

Under the hood:

  1. The query embeds and hits the vector store across the chemistry domain namespace — pulls exact documents with XYZ-2001 and phase transfer in close proximity.
  2. Simultaneously the data module runs M·q on the lab’s chemistry Matrix Memory. The matrix has been absorbing every notebook entry for 25 years. It returns a dense vector that decodes into associated keys: “tetraethylammonium bromide”, “yield collapse near pH 6”, “2003 fail log, B. Heinz”.
  3. The vector-side snippets and the matrix-side associations merge into one retrieval-context block.
  4. The LLM answers with the documents and the pattern — citations to the actual 2003 lab-book page, plus the smeared-out memory that “pH 6 killed the yield” is a lesson the lab learned a decade ago.

A 1M-token context wouldn’t have done this. It could’ve read the 2003 lab book in full if you had it pre-selected — but selecting the right 2003 lab book out of 25 years of them is the actual problem, and that’s what the memory layer solves.

What ships in alpha.3

What this means for a lab planning AI infrastructure

Three practical takeaways:

If you’re a lab, a research group, or an enterprise R&D org looking at 1M-token benchmarks and wondering whether that’s the AI-memory story you should build on: probably not. There’s a better primitive, it’s a short walk from Vienna, and it’s already shipped.

Install alpha.3 Data access article Module terminology
#ScientificAI #xLSTM #MatrixMemory #DeltaNet #Hochreiter #AssociativeMemory #RAG #Eldric