A separation-of-concerns architecture. The controller is routing only; the embedding model lives on the native inference daemon; the vector store lives on the data worker. Three processes, three responsibilities, one wire. This page is the technical view for power users and operators who want to know why it's shaped that way and how to bend it when they need to.
On a chat request that triggers RAG, three processes participate in order:
/api/v1/embeddings (proxy to Inferenced) and /api/v1/vector/search (aggregator over data workers). The controller does no in-process inference; it's all routing.The chat completion request then runs through the inference path with the retrieved passages injected as context, returning the grounded answer + citation metadata to the client.
If the controller did the embedding in-process, every node running a controller would need the embedding model loaded — eating RAM and burning startup time. Worse, in cluster setups the controller might not be the GPU-equipped node, so it'd be embedding on a CPU when a perfectly good GPU sits one host over. Keeping the controller a pure router means scaling the embedding work is decoupled from scaling the routing work.
nomic-embed-text-Q4_K_M is 80 MB on disk, runs comfortably on CPU, and produces 768-dimension vectors compatible with the data worker's storage layout. Quantising to Q4_K_M trades a small amount of embedding-quality precision for a model that fits in memory on a Raspberry Pi 4 and embeds 1k tokens in well under a second on any modern CPU. GGUF means llama.cpp can serve it directly; Inferenced already speaks GGUF, so this is the natural place to host it.
The data worker already holds the source documents, the chunks, the tenant boundaries, the audit ledger, and the file storage. Putting the vector entries alongside the chunks (rather than in a separate vector database) means deletion, re-embedding, tenant isolation and backup are all one operation, not five.
The default puts Inferenced and the controller on the same host (or wherever the controller can reach Inferenced over the cluster network). Some deployments want something different — Inferenced on a dedicated GPU host, an external embedding service, or a different GGUF model entirely. That's handled by one environment variable on the controller:
# /etc/eldric/eldric-aios.env ELDRIC_EMBED_BACKEND_URL=http://inferenced-host:8883
Any URL that speaks the OpenAI-compatible /v1/embeddings endpoint works — Inferenced is the default, but Ollama also satisfies it, as does a self-hosted llama.cpp server, vLLM, TGI, or any cloud embedding API if you want to send embeddings off-prem (most customers don't).
Restart the controller after editing the env file (sudo systemctl restart eldric-aios-controller) and the new backend picks up on the next chat request.
Everything on one host. The controller, Inferenced and the data worker are three systemd units on the same machine. Inferenced co-tenants with the LLM inference work; embedding takes a fraction of GPU memory and runs alongside larger models without contention.
The controller lives on a small management host. Inferenced lives on a GPU host (or any node with enough CPU if you don't want to dedicate GPU to embedding). The data worker lives on the storage-heavy host. The controller reaches Inferenced via ELDRIC_EMBED_BACKEND_URL; the controller reaches data workers via the topology discovery layer (no env var needed — heartbeat-pushed).
On a Pi 4 or NUC running the minimal edge runtime, all three live on the single edge host (the platform ships them as a unit). The edge node embeds locally, indexes locally, queries locally. If a central cluster is configured, the bundle export / import path moves whole knowledge bases between edge and centre without re-embedding.
If Inferenced is unreachable, the controller returns a structured error on the embeddings route (502 with explanatory body). New uploads fail to embed and stay marked "queued"; the platform retries on the next upload cycle once Inferenced is back. Chat queries that would have triggered RAG fall back to plain chat (no citations) rather than failing the whole request.
If the data worker is unreachable, vector-search returns 503; chat requests fall back to plain chat. Upload via the GUI's chunked-upload path queues on the controller and flushes once the data worker is back.
If the embedding model isn't loaded in Inferenced, the embeddings route returns 404 with a model-not-loaded hint. The Inferenced admin dashboard has a Load button; click it (the file lives at /data/eldric/models/nomic-embed-text-Q4_K_M.gguf on a stock install) and embedding picks up again on the next request.
For the customer-facing how-to: using RAG. For the compressed-memory preview that speeds up vector search at concurrency: advanced retrieval. For the inference-side preview that consults memory at the prompt boundary instead of round-tripping: smart memory inference.
For the rest of the system the RAG path sits in: how it works walks the whole 4-level architecture (Client → Edge → Controller / Router / Data → 10 workers including Inferenced).