How Eldric works — Architecture of a brain-inspired AI server

The processes, one paragraph each.

Edge

The only process exposed to the public network. Terminates TLS, validates the API key, enforces rate limits, and forwards to a Router. Has no model state of its own. Also serves the built-in chat shell at /chat.

Controller

Keeps the cluster topology in one place. Workers register here and heartbeat every thirty seconds. The controller owns the license file, the audit ledger, the backup orchestration, the rolling-upgrade coordinator, and the PKI for internal certificates.

Router

Decides which worker handles which request. Picks based on intent (a chat? a RAG search? a voice call?), load (which worker is busiest?), and theme (a medical question routes to a medically-tuned model). Has eight load-balancing strategies and an optional LLM-based decision mode.

Workers

One per function. Inference workers run models (Ollama, vLLM, llama.cpp, or a cloud API). Data workers store files, vectors, and matrix memory. Agent workers run the iterative reasoning loops. Media workers do speech-to-text, text-to-speech, and video. Comm workers carry email, SMS, WhatsApp, Signal, Teams, and VoIP. Science workers proxy to the 140 external scientific APIs. Training workers fine-tune models.

Inferenced

A native inference worker that loads GGUF and xLSTM models directly through embedded llama.cpp. No Ollama dependency. Use it for the smallest deployments and for air-gapped sites.

Three things worth knowing.

The same software runs on a Pi.

The 5.0 kernel is the same on a Raspberry Pi 4, a developer workstation, a rack-mounted server, and across a multi-node cluster. What changes is which modules you activate per node. A small box does not get a stripped-down product; it gets the same product with fewer modules switched on.

The data path is short.

Edge → Router → Worker. Three hops. Streaming responses pass through with no buffering. Knowledge-base search hits the EMM (compressed, associative memory) first and only falls back to the vector store when exact source citations are needed — for pure chat use cases the vector store can be dropped entirely. There is no hidden middleware that resells your data.

Honest scope: where it is fast, where it is not.

On our reference cluster, chat sustains 793 requests per second at 32 concurrent connections, with median latency of 41 milliseconds. That is good. Knowledge-base search at four concurrent connections still hits a ~7-second p50 latency cliff. That is not good, and we are fixing it. The numbers come from our 2026-05 baseline; we publish them so you know what to expect.

A short tour
of the architecture.

One request, five steps.

The processes, one paragraph each.

Edge

Controller

Router

Workers

Inferenced

Three things worth knowing.

The same software runs on a Pi.

The data path is short.

Honest scope: where it is fast, where it is not.

What it actually runs on.

A short tourof the architecture.

One request, five steps.

The processes, one paragraph each.

Edge

Controller

Router

Workers

Inferenced

Three things worth knowing.

The same software runs on a Pi.

The data path is short.

Honest scope: where it is fast, where it is not.

What it actually runs on.

A short tour
of the architecture.