Cluster administrator guide

Topology

What a cluster looks like.

An Eldric cluster has one controller, one or more routers, one or more inference workers, one or more data workers, and an optional ring of specialised workers (agent / media / comm / science / training / xLSTM / IoT). The edge gateway is the public entry point. All daemons run as systemd units on each host.

For a small evaluation cluster, a single host runs everything via the eldric-aios meta-package. For production, you split across hosts — typically the GPU-equipped boxes run inference + xLSTM, the storage-heavy boxes run data workers, and a small management host runs the controller + edge. The install guide covers the single-node path; the multi-node path is below.

Day 1 — install

Standing the cluster up.

Single-node evaluation

One host, one meta-package:

curl -fsSL https://repo.eldric.ai/install.sh | sudo bash
sudo dnf install eldric-aios
sudo systemctl enable --now eldric-aios

Within 30 seconds, the chat shell is at https://<host>/chat. First signup becomes admin. See first run for the post-install setup.

Multi-node production

Same install command on every host, but with a role flag:

# Management host
sudo dnf install eldric-aios-controller eldric-aios-edge

# Inference hosts (GPU-equipped)
sudo dnf install eldric-aios-worker eldric-aios-inferenced
sudo systemctl set-environment ELDRIC_CONTROLLER=https://mgmt-host:8880

# Data hosts
sudo dnf install eldric-aios-data

# Optional specialised
sudo dnf install eldric-aios-{agent,media,comm,science,training,xlstmd,iiotd}

Each daemon registers itself with the controller on first start. systemctl status eldric-* on each host confirms the lifecycle. The cluster topology page in the chat shell (Admin Console → Cluster) shows the registered workers in real time.

Day 2 — operations

The recurring admin surface.

Users & tenants

Admin Console → Users to add, suspend or remove users. Roles are Viewer / Developer / Admin / SuperAdmin (the latter for cross-tenant operations only). Admin Console → Tenants to add new tenants — one per department / study / project / customer. Per-tenant scope is enforced at the gateway; cross-tenant access is denied unconditionally.

Walkthrough — onboarding a new department: (1) create the tenant (Tenants → New) with a short slug; (2) assign a per-tenant storage quota; (3) add the department head as Admin of that tenant; (4) the Admin invites their users via the Admin Console of their own tenant. The platform-level SuperAdmin steps out at this point — day-to-day administration lives inside the tenant.

Knowledge bases

Admin Console → KBs to provision per-tenant knowledge bases. Each KB has its own embedding model + vector storage + (optional) matrix-memory layer. The compressed-memory preview lives here — see advanced retrieval for the opt-in path.

Walkthrough — adding documents: (1) KBs → New KB → pick embedding model and the optional matrix-memory tier; (2) KB → Upload → drop PDF / DOCX / Markdown / HTML / plain text (or pull from a Data Worker mount); (3) the embedding pipeline runs in the background — track in KBs → Status; (4) chat against the KB by selecting it in the chat shell's source picker, or query directly via the API. Re-embedding after a model change rebuilds the entire KB in place; no manual rollover needed.

Model registry

Admin Console → Models to manage which models are visible per tenant — show all, restrict to a curated list, hide external APIs entirely. The backend badges (Ollama, OpenAI, Inferenced, vLLM, llama.cpp, and so on — see model providers) are auto-derived from each model's source. Pinning a model as the per-tenant default makes it the entry point for new conversations.

Licensing

Admin Console → License to drop in your signed license file. The controller verifies the Ed25519 signature on the file and lifts limits accordingly. Mid-licence updates are hot — no restart. License email: license@core.at.

Logs & audit

journalctl -u eldric-aios for the unified meta-unit; per-daemon journalctl -u eldric-aios-controller etc. The audit ledger at /var/lib/eldric/audit/ hash-chains every admin-side action and AI-assisted decision — defensible record for compliance reviews. The ledger is append-only and tamper-evident; an admin reading the ledger cannot edit prior entries, even via direct file access. Admin Console → Audit exports a slice of the ledger as signed JSON for compliance handoff.

Monitoring

What to watch.

Health endpoints. Every daemon serves /health at its primary port. A simple liveness probe from your monitoring stack hits these.
Metrics endpoints. Same daemons serve /metrics in Prometheus format. Standard counters (request rate, error rate, latency percentiles) plus per-tenant / per-model breakdowns.
OpenTelemetry export. Off by default. To opt in, set the OTLP endpoint via the Admin Console → Telemetry — spans, counters and histograms flow to your collector. Low-cardinality path normalisation is built in.
Cluster topology page. Live worker / router / data-node status with current load. The first place to look when something feels slow.

Recommended alerts to wire into your existing stack:

p95 latency on /v1/chat/completions above your service objective — typically 2× the median over a rolling window. Fires when an inference worker, a backend model or a cloud provider has degraded.
Error rate on the same path above 1% over five minutes — covers backend outages, license expiry, capacity saturation.
Data Worker disk usage above 85% on any tenant volume — backup destination, vector store and matrix memory grow on the data worker.
Controller heartbeat misses on any worker over three intervals — the worker is gone or unreachable.
License-expiry approaching at 30 / 14 / 7 days — the controller emits a metric you can alert on; license renewals are not surprises.

The Admin Console → Telemetry page suggests sensible defaults for each. Tune to your traffic shape; alerts that never fire are noise to your on-call.

Backup

What to back up.

Two backup paths cover the cluster state:

The platform's own snapshot system. Admin Console → Backups creates a local snapshot covering controller state, vector storage, matrix memory and tenant configuration. Snapshots are SHA-256-verified and a manifest tracks dependencies between snapshots. Restore is per-snapshot.
Portable bundles (.nexus). Admin Console → Bundle export packages a tenant (or a project, or the full cluster) into a single signed file you can move between installations. See your data for the customer-facing view of this.

For offsite / disaster-recovery copies, mount your offsite storage on the data worker and point the snapshot system at it — the 5.0 path. 5.1 adds the offsite-destination automation.

Upgrades

From one alpha (or one minor release) to the next.

The controller runs a rolling-update orchestrator that walks every node in turn: drain → install → restart → verify, then move on. From the Admin Console → Updates, pick the target version and start; the orchestrator handles the sequence and reports per-node status.

For single-node installs the standard sudo dnf update eldric-aios works directly. For air-gapped clusters, mirror repo.eldric.ai/5.0/ locally and point dnf at the mirror.

Rollback automation arrives with 5.1 (§70). On 5.0, rollback is manual: pin the previous version and re-run the orchestrator.

Running an Eldric cluster.
A walkthrough.

What a cluster looks like.

Standing the cluster up.

Single-node evaluation

Multi-node production

The recurring admin surface.

Users & tenants

Knowledge bases

Model registry

Licensing

Logs & audit

What to watch.

What to back up.

From one alpha (or one minor release) to the next.

Confirming readiness before a soak.

Troubleshooting paths.

Next.

Running an Eldric cluster.A walkthrough.

What a cluster looks like.

Standing the cluster up.

Single-node evaluation

Multi-node production

The recurring admin surface.

Users & tenants

Knowledge bases

Model registry

Licensing

Logs & audit

What to watch.

What to back up.

From one alpha (or one minor release) to the next.

Confirming readiness before a soak.

Troubleshooting paths.

Next.

Running an Eldric cluster.
A walkthrough.