RAG on demand

Don't store everything.
Retrieve when needed.

Most RAG systems pre-warehouse documents and pay storage + re-embedding costs whether anyone asks about them or not. Eldric is different: the platform tries learned weights first, then associative memory, then your knowledge base, then live external sources — and only ingests the material the platform actually finds useful, when it finds it useful. The result is a knowledge layer that grows on signal, not on hoarding.

The cascade

Four tiers, smallest first.

On every query, the platform walks four tiers in order, stopping as soon as it has a confident answer:

ENRN learned weights — small neural classifier that has internalised your cluster's patterns. Sub-millisecond. If the query matches a well-known intent, the classifier answers directly without retrieval.
EMM associative memory — the compressed, generalising layer. Holds patterns the platform has learned over time. Microsecond-latency retrieval on CPU. Good for queries where the exact wording differs but the meaning is one the platform has seen.
Your knowledge base (RAG) — exact retrieval over your indexed documents. Returns the specific passages, with citations. Good for queries where the answer is "what does my document actually say".
Live external sources — when none of the above is sufficient, the platform queries configured external sources (Science Source Registry, web search, vendor APIs) live and synthesises a response with provenance. Only fires when needed.

Each tier carries a confidence signal; the platform only escalates to the next tier when the current one isn't confident enough. Saves cycles, saves money on paid external APIs, keeps the latency budget intact.

The retention loop

Search → accept → ingest → enrich → dream → train.

The cascade is the read path. The retention loop is what happens after — and it's what makes Eldric get smarter with use rather than getting heavier with use.

The flow:

Search. You ask a question; the cascade fires; you get an answer with citations.
Accept. Below the answer is a thumbs-up / thumbs-down footer. Clicking through a citation counts as an implicit accept.
Ingest. On accept, the platform auto-ingests the cited sources into the appropriate knowledge base. New documents from live external sources land in your RAG layer; the next query against the same topic doesn't need the live round-trip.
Enrich. The ingested documents pass through the content-aware chunking pipeline (per chunking strategies) and get auto-metadata — authors, DOIs, topic tags, cross-references, entity links.
Dream. On the next dream cycle, the platform extracts themes from accepted sessions and writes them into matrix memory. Patterns the platform sees often become fast-lookup patterns.
Train. Hot patterns become candidates for the next ENRN training corpus. Queries that used to hit tier-3 (RAG) start hitting tier-1 (learned weights) directly — the platform's response time drops over time without sacrificing accuracy.

The whole loop is opt-out per tenant; admins can run with the retention loop off if they want a static RAG layer. By default it's on, because that's the path to a knowledge layer that improves with traffic instead of stagnating.

What the user sees

Thumb up. Thumb down. That's the whole UI.

The retention loop runs behind the scenes. From the user's perspective, the only new surface is the small footer under each assistant response — a 👍 button, a 👎 button, and a "view sources" link that expands the citations inline. Clicking 👍 (or expanding a citation, which counts as a soft accept) starts the ingest. Clicking 👎 marks the answer as low-quality; the citations are not auto-ingested, the dream cycle weights them lower, and the platform tries different sources next time the same topic comes up.

The user never sees the cascade choosing tiers; they don't have to know whether their answer came from learned weights or live OpenAlex. The answer comes back with citations; the loop runs in the background.

What this changes operationally

Less storage, faster over time, paid less.

Less storage. A traditional pre-warehouse RAG ingests every source the operator can think of and hopes the relevant ones are in there. Eldric only ingests the sources that someone actually accepts — so the knowledge base stays the size of what the platform actually uses.

Faster over time. Queries that used to require tier-3 (vector search) move to tier-2 (compressed memory) move to tier-1 (learned weights) as the platform internalises patterns. Latency goes down without any tuning.

Paid external APIs less often. If a question's answer is in your own documents or already in the platform's learned weights, the tier-4 escalation never fires. Bills for paid embedding or paid retrieval APIs drop in proportion.

The trade-off is honest: a brand-new installation answers tier-3 / tier-4 a lot, because nothing's in the platform's learned weights yet. After a few weeks of accepted queries, tier-1 / tier-2 carries an increasing share. The platform pays back the cold-start tax over time, not all at once.

Going further

Next.

The customer-facing how-to for the RAG side: using RAG. The technical view of how the cascade is wired across the workers: how it works. The chunking layer that determines RAG hit quality: chunking strategies. For customers who want to teach Eldric their own intent classes (so tier-1 covers domain-specific queries faster): custom classification.