Cluster admin guide

Running an Eldric cluster.
A walkthrough.

This page is for the person whose job is keeping an Eldric installation healthy day-to-day — install, users, tenants, monitoring, backup, upgrade. It walks the standard ops surface in order; for the deep references (every endpoint, every flag) follow the links into the API reference and the per-feature pages.


Topology

What a cluster looks like.

An Eldric cluster has one controller, one or more routers, one or more inference workers, one or more data workers, and an optional ring of specialised workers (agent / media / comm / science / training / xLSTM / IoT). The edge gateway is the public entry point. All daemons run as systemd units on each host.

For a small evaluation cluster, a single host runs everything via the eldric-aios meta-package. For production, you split across hosts — typically the GPU-equipped boxes run inference + xLSTM, the storage-heavy boxes run data workers, and a small management host runs the controller + edge. The install guide covers the single-node path; the multi-node path is below.


Day 1 — install

Standing the cluster up.

Single-node evaluation

One host, one meta-package. The recommended single-command path:

curl -fsSL https://repo.eldric.ai/install-eldric.sh | sudo bash

That bootstrap chains the repository setup, dnf install eldric-aios, systemctl enable --now eldric-aios, and the post-install eldric setup health probe + summary. Pass -s -- --license-file PATH --admin-email EMAIL to activate a license file in the same shot.

Prefer the steps separately:

curl -fsSL https://repo.eldric.ai/install.sh | sudo bash
sudo dnf install eldric-aios
sudo systemctl enable --now eldric-aios
sudo eldric setup

Within 30 seconds the chat shell is at https://<host>/chat. (On a single node without a dedicated edge gateway, Eldric serves it on the local admin port until you add an edge.) First signup becomes admin. See first run for what to do next.

Multi-node production

Eldric 5.0 ships as one meta-package per host (eldric-aios); GPU hosts add the CUDA RPM (eldric-aios-cuda). The same install command runs on every node — what differs is the role assignment, set via environment file at /etc/eldric/eldric-aios.env.

# Every host — same one-liner
curl -fsSL https://repo.eldric.ai/install-eldric.sh | sudo bash

# GPU hosts add the CUDA package
sudo dnf install eldric-aios-cuda

# On each non-controller host, point at the controller and pin the role
echo "ELDRIC_AIOS_CONTROLLER_URL=https://mgmt-host:8880" | \
    sudo tee -a /etc/eldric/eldric-aios.env
echo "ELDRIC_AIOS_ROLE=worker"  | sudo tee -a /etc/eldric/eldric-aios.env   # or controller / data / edge / agent / media / comm / science / training / iot
sudo systemctl restart eldric-aios

Each daemon registers itself with the controller on first start. systemctl status eldric-aios on each host confirms the lifecycle; the unified meta-unit hosts every module on that node. The cluster topology page in the chat shell (Admin Console → Cluster) shows the registered nodes in real time. Role-pinning matters — leaving a bare node on the default role=all can trigger health watchdog crash-loops on hardware that doesn't fit the full stack.


Day 2 — operations

The recurring admin surface.

Users & tenants

Admin Console → Users to add, suspend or remove users. Roles are Viewer / Developer / Admin / SuperAdmin (the latter for cross-tenant operations only). Admin Console → Tenants to add new tenants — one per department / study / project / customer. Per-tenant scope is enforced at the gateway; cross-tenant access is denied unconditionally.

Walkthrough — onboarding a new department: (1) create the tenant (Tenants → New) with a short slug; (2) assign a per-tenant storage quota; (3) add the department head as Admin of that tenant; (4) the Admin invites their users via the Admin Console of their own tenant. The platform-level SuperAdmin steps out at this point — day-to-day administration lives inside the tenant.

Knowledge bases

Admin Console → KBs to provision per-tenant knowledge bases. Each KB has its own embedding model + vector storage + (optional) matrix-memory layer. The compressed-memory preview lives here — see advanced retrieval for the opt-in path.

Walkthrough — adding documents: (1) KBs → New KB → pick embedding model and the optional matrix-memory tier; (2) KB → Upload → drop PDF / DOCX / Markdown / HTML / plain text (or pull from a Data Worker mount); (3) the embedding pipeline runs in the background — track in KBs → Status; (4) chat against the KB by selecting it in the chat shell's source picker, or query directly via the API. Re-embedding after a model change rebuilds the entire KB in place; no manual rollover needed.

Model registry

Admin Console → Models to manage which models are visible per tenant — show all, restrict to a curated list, hide external APIs entirely. The backend badges (Ollama, OpenAI, Inferenced, vLLM, llama.cpp, and so on — see model providers) are auto-derived from each model's source. Pinning a model as the per-tenant default makes it the entry point for new conversations.

Licensing

Admin Console → License to drop in your signed license file. The controller verifies the Ed25519 signature on the file and lifts limits accordingly. Mid-licence updates are hot — no restart. License email: license@core.at.

Logs & audit

journalctl -u eldric-aios tails the unified meta-unit (the 5.0 daemon hosts every module that node's role enabled). The audit ledger at /var/lib/eldric/audit/ hash-chains every admin-side action and AI-assisted decision — defensible record for compliance reviews. The ledger is append-only and tamper-evident; an admin reading the ledger cannot edit prior entries, even via direct file access. Admin Console → Audit exports a slice of the ledger as signed JSON for compliance handoff.


Monitoring

What to watch.

Recommended alerts to wire into your existing stack:

The Admin Console → Telemetry page suggests sensible defaults for each. Tune to your traffic shape; alerts that never fire are noise to your on-call.


Backup

What to back up.

Two backup paths cover the cluster state:

For offsite / disaster-recovery copies, mount your offsite storage on the data worker and point the snapshot system at it — the 5.0 path. an upcoming 5.0.x patch adds the offsite-destination automation.


Upgrades

From one patch release to the next.

The controller runs a rolling-update orchestrator that walks every node in turn: drain → install → restart → verify, then move on. From the Admin Console → Updates, pick the target version and start; the orchestrator handles the sequence and reports per-node status.

For single-node installs the standard sudo dnf update eldric-aios works directly. For air-gapped clusters, mirror repo.eldric.ai/5.0/ locally and point dnf at the mirror.

Rollback automation arrives in an upcoming 5.0.x patch. On 5.0, rollback is manual: pin the previous version and re-run the orchestrator.


Stress testing

Confirming readiness before a soak.

The platform ships a stress-test harness — parallel-user × request-count load against a cluster host with pass/fail thresholds for p99 latency and error budget. Run it before the soak window when you're commissioning a cluster, and re-run after a meaningful capacity change. The results compare against the published demo-cluster baseline.


When something breaks

Troubleshooting paths.


Next.

For the developer-side view: for developers + API reference. For the deeper-on-the-platform view: how it works. Questions: office@eldric.ai.