Eldric clusters survive node failures, network partitions, and entire datacenter losses — automatically. From a single developer laptop to a 50+ node federation spanning continents, the same software, the same configuration grammar, the same install package.
The same eldric-aios RPM runs on a Raspberry Pi sitting under a developer's monitor and on a 50-node bare-metal cluster behind a hospital network. The role each node plays is configuration, not code. Start with one. Grow to three the day uptime starts mattering. Federate across sites the day you open a second office.
config.yaml — the upgrade path from "single laptop trial" to "multi-region enterprise" is a question of nodes you add, not software you migrate to.
Production-grade multi-controller Raft consensus. Every node in an Eldric cluster keeps a heartbeat to its peers. When a leader disappears — the host dies, the network drops, an admin pulls the wrong cable — the remaining nodes notice within seconds and elect a new leader. Cluster forms, replicates, fails over, and crash-recovers under live test. Inference traffic re-routes automatically. Open chat sessions continue from the next request.
A cluster that needs a human in the loop to recover is not a cluster — it's a single point of failure with extra hops.
An enterprise rarely lives in one building. Federation lets each site run its own autonomous Eldric cluster — keeping the bandwidth-heavy traffic local — while a federation layer keeps directories, knowledge bases, and access policy in sync. A branch office stays useful when the WAN link drops. The headquarters keeps a warm replica of every branch for regional disaster recovery.
A federated cluster does not turn into a thin client. The branch keeps its local models, its local RAG corpus, and its local users — and reconciles state with the rest of the federation when the link returns.
Per-tenant policy lets you mark knowledge bases as EU-only, US-only, or single-site. The federation layer refuses to replicate anything tagged outside its allowed regions — a rule the cluster enforces, not the operator.
The warm DR replica receives a continuous stream of federation updates. If the primary datacenter is offline — fire, flood, fibre cut — the replica is already current to within minutes. Cutover is an operator decision, not a recovery project.
The hardest part of running a distributed system is usually not the consensus — it is keeping track of which machine is at which address. Eldric stacks four discovery layers so an operator does not maintain a list. On a quiet office LAN, nodes find each other in seconds. On a sprawled WAN, a small handful of admin hints kicks off the chain.
An at-a-glance map from node count to what it gives you. Every row is the same Eldric package — the differences are in how many nodes you give it, and what role you assign each.
| Nodes | Topology | Failover | Typical use case |
|---|---|---|---|
| 1 | single host | none (restart only) | Developer laptop, single-team trial, Raspberry Pi proof of concept. Same package — just one of it. |
| 2–3 | quorum cluster | automatic | Small office, always-on deployment. Three nodes is the sweet spot — survives any single node loss with a clean majority vote. |
| 4–10 | leader + workers | automatic | Department or mid-size company. A handful of leader-eligible nodes; the rest are dedicated workers running models, RAG indexes, or specialist agents. |
| 10–50 | multi-role mesh | automatic + zonal | Enterprise on a single site. Workers split across racks or availability zones so a rack-level loss is survivable. Leader stays in a different zone. |
| 50+ | federated mesh | automatic + cross-site | Multi-site, multi-region. Per-site clusters federate over the WAN; warm DR replicas in another region; per-tenant data-residency rules enforced at the federation layer. |
A single-node install is genuinely production-grade for a lot of teams. Clustering pays for itself in five specific situations.
A three-column ledger of where each piece of the resilience story actually is. Anything not in the leftmost column is not yet running in production — we will not pretend otherwise.
5.0 — shipping
Next 5.0.x patches — designed & dispatched
Later in 5.0.x — on the roadmap
Roadmap timelines change. Anything in the right two columns is genuinely planned and budgeted — but we will not promise a calendar quarter for software we have not finished. The thing we will promise is that the table will tell you the truth.
A three-node cluster runs comfortably on three commodity Linux boxes. A federated multi-site deployment runs on whatever the IT department already owns. The download is the same; the configuration changes.