RAG architecture

How RAG
actually wires up.

A separation-of-concerns architecture. The controller is routing only; the embedding model lives on the native inference daemon; the vector store lives on the data worker. Three processes, three responsibilities, one wire. This page is the technical view for power users and operators who want to know why it's shaped that way and how to bend it when they need to.


The chain

Controller → Inferenced → Data Worker.

On a chat request that triggers RAG, three processes participate in order:

  1. Controller (port 8880) — receives the request, routes it. The RAG path runs two routes: /api/v1/embeddings (proxy to Inferenced) and /api/v1/vector/search (aggregator over data workers). The controller does no in-process inference; it's all routing.
  2. Inferenced (port 8883) — the native GGUF runtime. Holds a quantised embedding model (nomic-embed-text-Q4_K_M.gguf, 768-dim, ~80 MB) loaded in memory. The controller's embeddings route proxies here.
  3. Data Worker (port 8892) — stores the indexed documents and the vector entries. The controller's vector-search aggregator embeds the query (via #2), then asks the data worker for the k-nearest-neighbour hits.

The chat completion request then runs through the inference path with the retrieved passages injected as context, returning the grounded answer + citation metadata to the client.


Why this shape

Three reasons.

1. The controller stays routing-only.

If the controller did the embedding in-process, every node running a controller would need the embedding model loaded — eating RAM and burning startup time. Worse, in cluster setups the controller might not be the GPU-equipped node, so it'd be embedding on a CPU when a perfectly good GPU sits one host over. Keeping the controller a pure router means scaling the embedding work is decoupled from scaling the routing work.

2. The embedding model is GGUF + small.

nomic-embed-text-Q4_K_M is 80 MB on disk, runs comfortably on CPU, and produces 768-dimension vectors compatible with the data worker's storage layout. Quantising to Q4_K_M trades a small amount of embedding-quality precision for a model that fits in memory on a Raspberry Pi 4 and embeds 1k tokens in well under a second on any modern CPU. GGUF means llama.cpp can serve it directly; Inferenced already speaks GGUF, so this is the natural place to host it.

3. The vector store is where the data lives.

The data worker already holds the source documents, the chunks, the tenant boundaries, the audit ledger, and the file storage. Putting the vector entries alongside the chunks (rather than in a separate vector database) means deletion, re-embedding, tenant isolation and backup are all one operation, not five.


Custom topologies

Pointing the controller at a different embedding backend.

The default puts Inferenced and the controller on the same host (or wherever the controller can reach Inferenced over the cluster network). Some deployments want something different — Inferenced on a dedicated GPU host, an external embedding service, or a different GGUF model entirely. That's handled by one environment variable on the controller:

# /etc/eldric/eldric-aios.env

ELDRIC_EMBED_BACKEND_URL=http://inferenced-host:8883

Any URL that speaks the OpenAI-compatible /v1/embeddings endpoint works — Inferenced is the default, but Ollama also satisfies it, as does a self-hosted llama.cpp server, vLLM, TGI, or any cloud embedding API if you want to send embeddings off-prem (most customers don't).

Restart the controller after editing the env file (sudo systemctl restart eldric-aios-controller) and the new backend picks up on the next chat request.


Single-node vs cluster

Where each process lives.

Single-node

Everything on one host. The controller, Inferenced and the data worker are three systemd units on the same machine. Inferenced co-tenants with the LLM inference work; embedding takes a fraction of GPU memory and runs alongside larger models without contention.

Multi-node

The controller lives on a small management host. Inferenced lives on a GPU host (or any node with enough CPU if you don't want to dedicate GPU to embedding). The data worker lives on the storage-heavy host. The controller reaches Inferenced via ELDRIC_EMBED_BACKEND_URL; the controller reaches data workers via the topology discovery layer (no env var needed — heartbeat-pushed).

Edge

On a Pi 4 or NUC running the minimal edge runtime, all three live on the single edge host (the platform ships them as a unit). The edge node embeds locally, indexes locally, queries locally. If a central cluster is configured, the bundle export / import path moves whole knowledge bases between edge and centre without re-embedding.


Failure modes

What happens when something is down.

If Inferenced is unreachable, the controller returns a structured error on the embeddings route (502 with explanatory body). New uploads fail to embed and stay marked "queued"; the platform retries on the next upload cycle once Inferenced is back. Chat queries that would have triggered RAG fall back to plain chat (no citations) rather than failing the whole request.

If the data worker is unreachable, vector-search returns 503; chat requests fall back to plain chat. Upload via the GUI's chunked-upload path queues on the controller and flushes once the data worker is back.

If the embedding model isn't loaded in Inferenced, the embeddings route returns 404 with a model-not-loaded hint. The Inferenced admin dashboard has a Load button; click it (the file lives at /data/eldric/models/nomic-embed-text-Q4_K_M.gguf on a stock install) and embedding picks up again on the next request.


Going further

Next.

For the customer-facing how-to: using RAG. For the compressed-memory preview that speeds up vector search at concurrency: advanced retrieval. For the inference-side preview that consults memory at the prompt boundary instead of round-tripping: smart memory inference.

For the rest of the system the RAG path sits in: how it works walks the whole 4-level architecture (Client → Edge → Controller / Router / Data → 10 workers including Inferenced).