How it works

A short tour
of the architecture.

Eldric is a small set of co-operating processes. Each one does one job. They can all run on one machine for a developer trial, or spread across many machines for a multi-tenant production cluster.


The picture

One request, five steps.

Architecture · request flow
Client Web · CLI · GUI Edge TLS · auth · rate-limit port 443 / 80 Controller cluster topology · license · audit Router intent + load + theme picks the right worker Worker — Inference Ollama · vLLM · llama.cpp · cloud APIs Worker — Data file storage · vector RAG · matrix memory Worker — Agent agentic RAG · multi-agent · workflows Worker — Media STT · TTS · video · voice chat Worker — Comm email · SMS · WhatsApp · Signal · Teams · VoIP Worker — Science 140+ scientific API integrations Worker — Training LoRA · DPO · federated learning

A request from a client lands at the Edge (TLS + auth + rate-limit). The Edge hands it to a Router. The Router asks the Controller for the current topology, classifies the request (chat? RAG? voice? science look-up?), and picks the right worker. The worker does the work and streams the answer back along the same path. The data path is intentionally short.


The processes, one paragraph each.

Edge

The only process exposed to the public network. Terminates TLS, validates the API key, enforces rate limits, and forwards to a Router. Has no model state of its own. Also serves the built-in chat shell at /chat.

Controller

Keeps the cluster topology in one place. Workers register here and heartbeat every thirty seconds. The controller owns the license file, the audit ledger, the backup orchestration, the rolling-upgrade coordinator, and the PKI for internal certificates.

Router

Decides which worker handles which request. Picks based on intent (a chat? a RAG search? a voice call?), load (which worker is busiest?), and theme (a medical question routes to a medically-tuned model). Has eight load-balancing strategies and an optional LLM-based decision mode.

Workers

One per function. Inference workers run models (Ollama, vLLM, llama.cpp, or a cloud API). Data workers store files, vectors, and matrix memory. Agent workers run the iterative reasoning loops. Media workers do speech-to-text, text-to-speech, and video. Comm workers carry email, SMS, WhatsApp, Signal, Teams, and VoIP. Science workers proxy to the 140 external scientific APIs. Training workers fine-tune models.

Inferenced

A native inference worker that loads GGUF and xLSTM models directly through embedded llama.cpp. No Ollama dependency. Use it for the smallest deployments and for air-gapped sites.

Three things worth knowing.

The same software runs on a Pi.

The 5.0 kernel is the same on a Raspberry Pi 4, a developer workstation, a rack-mounted server, and across a multi-node cluster. What changes is which modules you activate per node. A small box does not get a stripped-down product; it gets the same product with fewer modules switched on.

The data path is short.

Edge → Router → Worker. Three hops. Streaming responses pass through with no buffering. Knowledge-base search hits the EMM (compressed, associative memory) first and only falls back to the vector store when exact source citations are needed — for pure chat use cases the vector store can be dropped entirely. There is no hidden middleware that resells your data.

Honest scope: where it is fast, where it is not.

On our reference cluster, chat sustains 793 requests per second at 32 concurrent connections, with median latency of 41 milliseconds. That is good. Knowledge-base search at four concurrent connections still hits a ~7-second p50 latency cliff. That is not good, and we are fixing it. The numbers come from our 2026-05 baseline; we publish them so you know what to expect.


Hardware

What it actually runs on.

Our reference cluster is intentionally modest. The numbers above come from this hardware.

1
Inference-tier GPU (RTX 4070 Ti, 12 GB) for LLMs
1
Router-tier GPU (RTX 2080, 8 GB) for routing + small models
5
Worker nodes total, including controller, edge, data
Pi 4
The smallest target — 8 GB RAM is enough for kernel + light models