Eldric is a small set of co-operating processes. Each one does one job. They can all run on one machine for a developer trial, or spread across many machines for a multi-tenant production cluster.
A request from a client lands at the Edge (TLS + auth + rate-limit). The Edge hands it to a Router. The Router asks the Controller for the current topology, classifies the request (chat? RAG? voice? science look-up?), and picks the right worker. The worker does the work and streams the answer back along the same path. The data path is intentionally short.
The only process exposed to the public network. Terminates TLS, validates the API key, enforces rate limits, and forwards to a Router. Has no model state of its own. Also serves the built-in chat shell at /chat.
Keeps the cluster topology in one place. Workers register here and heartbeat every thirty seconds. The controller owns the license file, the audit ledger, the backup orchestration, the rolling-upgrade coordinator, and the PKI for internal certificates.
Decides which worker handles which request. Picks based on intent (a chat? a RAG search? a voice call?), load (which worker is busiest?), and theme (a medical question routes to a medically-tuned model). Has eight load-balancing strategies and an optional LLM-based decision mode.
One per function. Inference workers run models (Ollama, vLLM, llama.cpp, or a cloud API). Data workers store files, vectors, and matrix memory. Agent workers run the iterative reasoning loops. Media workers do speech-to-text, text-to-speech, and video. Comm workers carry email, SMS, WhatsApp, Signal, Teams, and VoIP. Science workers proxy to the 140 external scientific APIs. Training workers fine-tune models.
A native inference worker that loads GGUF and xLSTM models directly through embedded llama.cpp. No Ollama dependency. Use it for the smallest deployments and for air-gapped sites.
The 5.0 kernel is the same on a Raspberry Pi 4, a developer workstation, a rack-mounted server, and across a multi-node cluster. What changes is which modules you activate per node. A small box does not get a stripped-down product; it gets the same product with fewer modules switched on.
Edge → Router → Worker. Three hops. Streaming responses pass through with no buffering. Knowledge-base search hits the EMM (compressed, associative memory) first and only falls back to the vector store when exact source citations are needed — for pure chat use cases the vector store can be dropped entirely. There is no hidden middleware that resells your data.
On our reference cluster, chat sustains 793 requests per second at 32 concurrent connections, with median latency of 41 milliseconds. That is good. Knowledge-base search at four concurrent connections still hits a ~7-second p50 latency cliff. That is not good, and we are fixing it. The numbers come from our 2026-05 baseline; we publish them so you know what to expect.
Our reference cluster is intentionally modest. The numbers above come from this hardware.