Cluster admin guide

Running an Eldric cluster.
A walkthrough.

This page is for the person whose job is keeping an Eldric installation healthy day-to-day — install, users, tenants, monitoring, backup, upgrade. It walks the standard ops surface in order; for the deep references (every endpoint, every flag) follow the links into the API reference and the per-feature pages.


Topology

What a cluster looks like.

An Eldric cluster has one controller, one or more routers, one or more inference workers, one or more data workers, and an optional ring of specialised workers (agent / media / comm / science / training / xLSTM / IoT). The edge gateway is the public entry point. All daemons run as systemd units on each host.

For a small evaluation cluster, a single host runs everything via the eldric-aios meta-package. For production, you split across hosts — typically the GPU-equipped boxes run inference + xLSTM, the storage-heavy boxes run data workers, and a small management host runs the controller + edge. The install guide covers the single-node path; the multi-node path is below.


Day 1 — install

Standing the cluster up.

Single-node evaluation

One host, one meta-package:

curl -fsSL https://repo.eldric.ai/install.sh | sudo bash
sudo dnf install eldric-aios
sudo systemctl enable --now eldric-aios

Within 30 seconds, the chat shell is at https://<host>/chat. First signup becomes admin. See first run for the post-install setup.

Multi-node production

Same install command on every host, but with a role flag:

# Management host
sudo dnf install eldric-aios-controller eldric-aios-edge

# Inference hosts (GPU-equipped)
sudo dnf install eldric-aios-worker eldric-aios-inferenced
sudo systemctl set-environment ELDRIC_CONTROLLER=https://mgmt-host:8880

# Data hosts
sudo dnf install eldric-aios-data

# Optional specialised
sudo dnf install eldric-aios-{agent,media,comm,science,training,xlstmd,iiotd}

Each daemon registers itself with the controller on first start. systemctl status eldric-* on each host confirms the lifecycle. The cluster topology page in the chat shell (Admin Console → Cluster) shows the registered workers in real time.


Day 2 — operations

The recurring admin surface.

Users & tenants

Admin Console → Users to add, suspend or remove users. Roles are Viewer / Developer / Admin / SuperAdmin (the latter for cross-tenant operations only). Admin Console → Tenants to add new tenants — one per department / study / project / customer. Per-tenant scope is enforced at the gateway; cross-tenant access is denied unconditionally.

Walkthrough — onboarding a new department: (1) create the tenant (Tenants → New) with a short slug; (2) assign a per-tenant storage quota; (3) add the department head as Admin of that tenant; (4) the Admin invites their users via the Admin Console of their own tenant. The platform-level SuperAdmin steps out at this point — day-to-day administration lives inside the tenant.

Knowledge bases

Admin Console → KBs to provision per-tenant knowledge bases. Each KB has its own embedding model + vector storage + (optional) matrix-memory layer. The compressed-memory preview lives here — see advanced retrieval for the opt-in path.

Walkthrough — adding documents: (1) KBs → New KB → pick embedding model and the optional matrix-memory tier; (2) KB → Upload → drop PDF / DOCX / Markdown / HTML / plain text (or pull from a Data Worker mount); (3) the embedding pipeline runs in the background — track in KBs → Status; (4) chat against the KB by selecting it in the chat shell's source picker, or query directly via the API. Re-embedding after a model change rebuilds the entire KB in place; no manual rollover needed.

Model registry

Admin Console → Models to manage which models are visible per tenant — show all, restrict to a curated list, hide external APIs entirely. The backend badges (Ollama, OpenAI, Inferenced, vLLM, llama.cpp, and so on — see model providers) are auto-derived from each model's source. Pinning a model as the per-tenant default makes it the entry point for new conversations.

Licensing

Admin Console → License to drop in your signed license file. The controller verifies the Ed25519 signature on the file and lifts limits accordingly. Mid-licence updates are hot — no restart. License email: license@core.at.

Logs & audit

journalctl -u eldric-aios for the unified meta-unit; per-daemon journalctl -u eldric-aios-controller etc. The audit ledger at /var/lib/eldric/audit/ hash-chains every admin-side action and AI-assisted decision — defensible record for compliance reviews. The ledger is append-only and tamper-evident; an admin reading the ledger cannot edit prior entries, even via direct file access. Admin Console → Audit exports a slice of the ledger as signed JSON for compliance handoff.


Monitoring

What to watch.

Recommended alerts to wire into your existing stack:

The Admin Console → Telemetry page suggests sensible defaults for each. Tune to your traffic shape; alerts that never fire are noise to your on-call.


Backup

What to back up.

Two backup paths cover the cluster state:

For offsite / disaster-recovery copies, mount your offsite storage on the data worker and point the snapshot system at it — the 5.0 path. 5.1 adds the offsite-destination automation.


Upgrades

From one alpha (or one minor release) to the next.

The controller runs a rolling-update orchestrator that walks every node in turn: drain → install → restart → verify, then move on. From the Admin Console → Updates, pick the target version and start; the orchestrator handles the sequence and reports per-node status.

For single-node installs the standard sudo dnf update eldric-aios works directly. For air-gapped clusters, mirror repo.eldric.ai/5.0/ locally and point dnf at the mirror.

Rollback automation arrives with 5.1 (§70). On 5.0, rollback is manual: pin the previous version and re-run the orchestrator.


Stress testing

Confirming readiness before a soak.

The platform ships a stress-test harness — parallel-user × request-count load against a cluster host with pass/fail thresholds for p99 latency and error budget. Run it before the soak window when you're commissioning a cluster, and re-run after a meaningful capacity change. The results compare against the published demo-cluster baseline.


When something breaks

Troubleshooting paths.


Next.

For the developer-side view: for developers + API reference. For the deeper-on-the-platform view: how it works. For the GA prep: road to 5.0 GA. Questions: office@eldric.ai.