A separation-of-concerns architecture, all inside one binary. Routing, native embedding and vector storage are three distinct responsibilities Eldric carries as separate roles — the same self-contained platform playing different parts, spread across your cluster nodes by role rather than assembled from separate services. This page is the technical view for power users and operators who want to know why the RAG path is shaped that way, and how the pieces sit relative to each other.
When a chat request triggers retrieval, three roles participate in order — all provided by the same Eldric binary, whether they run on one machine or across a cluster:
The chat completion then runs with those retrieved passages injected as context, and the platform returns the grounded answer along with citation metadata pointing back to the source.
If the routing role embedded queries itself, every node acting as a coordinator would need the embedding model loaded — eating memory and slowing startup. Worse, in a cluster the coordinating node may not be the one with a GPU, so it would embed on a CPU while a perfectly good GPU sat one host over. Keeping routing a pure coordinator means the embedding work scales independently of the routing work.
The embedding model is deliberately compact — it runs comfortably on CPU, fits on modest hardware down to a small single-board machine, and turns a page of text into a vector in a fraction of a second. Because Eldric already serves models natively, hosting the embedding model as part of native inference is the natural home for it: no extra runtime, no separate service, just another model the platform already knows how to serve.
The data role already holds the source documents, the chunks, the tenant boundaries, the audit trail and the file storage. Putting the vector entries alongside the chunks — rather than in a separate vector database — means deletion, re-embedding, tenant isolation and backup are all one operation, not five.
These three responsibilities are not three products stitched together. They are roles within the single Eldric platform. Install the platform once and it can play all of them; in a larger deployment you decide which nodes take which roles, and the platform coordinates across them over the cluster network.
That is the pattern throughout Eldric: capabilities are modules of one self-contained system, not a stack you assemble. Retrieval is simply one place where three of those roles line up in sequence — coordinate, embed, store — to turn a question into a grounded, cited answer.
The compressed-memory side of retrieval rides on the same data role, in Eldric's compact portable .emm format, queried alongside the exact vector store so fast associative recall and precise lookup arrive together.
Everything on one host. Routing, native inference and the data role are all served by the one platform on the one machine. Embedding co-tenants with the rest of the inference work; it takes a fraction of the available memory and runs alongside larger models without contention.
Spread the roles across the cluster. A small management node coordinates; a GPU-equipped node (or any node with enough CPU headroom, if you would rather not dedicate GPU to embedding) serves native inference; a storage-heavy node holds the data role. The platform discovers where each role lives and coordinates across them automatically — you assign roles in the admin console, not through startup flags.
On a small edge box, all three roles run on the single host — the platform ships as one unit. The edge node embeds locally, indexes locally and queries locally. If a central cluster is configured, whole knowledge bases move between edge and centre without re-embedding.
If native inference is unreachable, the platform returns a structured error rather than a broken answer. New uploads that need embedding stay marked "queued" and retry on the next cycle once the role is back. Chat queries that would have triggered retrieval fall back to plain chat — no citations — instead of failing the whole request.
If the data role is unreachable, vector search reports the role as unavailable and chat requests fall back to plain chat. Uploads queue on the coordinating node and flush once the data role is back.
If the embedding model isn't loaded, the platform says so plainly with a clear hint. The admin console has a load control; the model is present on a stock install, and embedding picks up again on the next request once it's loaded.
For the customer-facing how-to: using RAG. For the compressed-memory preview that speeds up vector search at concurrency: advanced retrieval. For the inference-side preview that consults memory at the prompt boundary instead of round-tripping: smart memory inference.
For the rest of the system the RAG path sits in: how it works walks the whole architecture, from client to edge to the coordinating, routing and data roles and the native inference behind them.