Clustering & high availability

Resilient AI infrastructure
for serious deployments.

Eldric clusters survive node failures, network partitions, and entire datacenter losses — automatically. From a single developer laptop to a 50+ node federation spanning continents, the same software, the same configuration grammar, the same install package.

Download for your team See the scale chart

Cluster topology Automatic failover Multi-site federation Smart discovery Deployment scales When you need it Honest roadmap

§1 — How Eldric clusters

One package. Five postures.

Available today

The same eldric-aios RPM runs on a Raspberry Pi sitting under a developer's monitor and on a 50-node bare-metal cluster behind a hospital network. The role each node plays is configuration, not code. Start with one. Grow to three the day uptime starts mattering. Federate across sites the day you open a second office.

No re-platforming. Every cluster posture below is the same binary with a different config.yaml — the upgrade path from "single laptop trial" to "multi-region enterprise" is a question of nodes you add, not software you migrate to.

Fig. 01 Eldric scales from a single developer laptop to a 50+ node enterprise cluster. The same software ships in all five postures — node count is a configuration choice, not a product tier. Leader-eligible nodes are marked in terracotta; workers in navy.

§2 — Automatic failover

When a node falls, the cluster catches it.

Production-grade · two gates

Production-grade multi-controller Raft consensus. Every node in an Eldric cluster keeps a heartbeat to its peers. When a leader disappears — the host dies, the network drops, an admin pulls the wrong cable — the remaining nodes notice within seconds and elect a new leader. Cluster forms, replicates, fails over, and crash-recovers under live test. Inference traffic re-routes automatically. Open chat sessions continue from the next request.

Honest scope. The consensus story is real and validated end-to-end under production-like conditions. The full production-HA bootstrap story still has two pieces in flight for an upcoming 5.0.x patch: cross-controller identity replication (so a recovered controller picks up the cluster's identity state) and a leader-aware client endpoint (so clients always reach the current leader without manual reconfiguration). Both are designed and dispatched; both close cleanly.

Fig. 02 Failover sequence: a healthy three-node quorum detects a leader loss in roughly five seconds and elects a replacement before most in-flight requests would even time out. The Raft-style election guarantees exactly one leader at a time, even during network partitions.

A cluster that needs a human in the loop to recover is not a cluster — it's a single point of failure with extra hops.

§3 — Multi-site federation

One brain. Many bodies.

On the roadmap — later in 5.0.x

An enterprise rarely lives in one building. Federation lets each site run its own autonomous Eldric cluster — keeping the bandwidth-heavy traffic local — while a federation layer keeps directories, knowledge bases, and access policy in sync. A branch office stays useful when the WAN link drops. The headquarters keeps a warm replica of every branch for regional disaster recovery.

Why federation, not one big cluster? Sub-millisecond consensus does not survive a transatlantic round trip. Federation accepts that physics and works with it — local clusters for hot paths, federated sync for eventual consistency.

Fig. 03 A federated Eldric deployment: a primary Vienna headquarters, three regional branches, and a warm disaster-recovery replica. Each site holds its own autonomous cluster; the federation layer keeps directories, RAG corpora, and policy in sync.

Branch autonomy

Each site stays useful when the WAN drops.

A federated cluster does not turn into a thin client. The branch keeps its local models, its local RAG corpus, and its local users — and reconciles state with the rest of the federation when the link returns.

Regional sovereignty

Data stays where the law says it must.

Per-tenant policy lets you mark knowledge bases as EU-only, US-only, or single-site. The federation layer refuses to replicate anything tagged outside its allowed regions — a rule the cluster enforces, not the operator.

Disaster recovery

Lose a building, keep the work.

The warm DR replica receives a continuous stream of federation updates. If the primary datacenter is offline — fire, flood, fibre cut — the replica is already current to within minutes. Cutover is an operator decision, not a recovery project.

§4 — Smart discovery

Nodes find each other without a spreadsheet.

Layers 1+2 today

The hardest part of running a distributed system is usually not the consensus — it is keeping track of which machine is at which address. Eldric stacks four discovery layers so an operator does not maintain a list. On a quiet office LAN, nodes find each other in seconds. On a sprawled WAN, a small handful of admin hints kicks off the chain.

Today. mDNS plus the in-cluster gossip protocol cover any single-site deployment. The remaining two layers — DNS-SD and admin WAN hints — arrive with multi-site federation later in 5.0.x.

Fig. 04 Four layers of discovery, attempted in order. On a single LAN, layer one is usually enough; layer two carries everything once the cluster is warm. Layers three and four arrive with multi-site federation.

§5 — Deployment scales

What fits each posture.

An at-a-glance map from node count to what it gives you. Every row is the same Eldric package — the differences are in how many nodes you give it, and what role you assign each.

Nodes	Topology	Failover	Typical use case
1	single host	none (restart only)	Developer laptop, single-team trial, Raspberry Pi proof of concept. Same package — just one of it.
2–3	quorum cluster	automatic	Small office, always-on deployment. Three nodes is the sweet spot — survives any single node loss with a clean majority vote.
4–10	leader + workers	automatic	Department or mid-size company. A handful of leader-eligible nodes; the rest are dedicated workers running models, RAG indexes, or specialist agents.
10–50	multi-role mesh	automatic + zonal	Enterprise on a single site. Workers split across racks or availability zones so a rack-level loss is survivable. Leader stays in a different zone.
50+	federated mesh	automatic + cross-site	Multi-site, multi-region. Per-site clusters federate over the WAN; warm DR replicas in another region; per-tenant data-residency rules enforced at the federation layer.

§6 — When you need this

Five reasons to cluster Eldric.

A single-node install is genuinely production-grade for a lot of teams. Clustering pays for itself in five specific situations.

Uptime

Uptime-critical AI. Internal copilots that the whole company depends on. Customer-facing chat baked into a product. Workflows where "AI is down" means humans cannot work. Three nodes give you any-single-node tolerance; ten give you any-rack tolerance.

Residency

Regulatory data residency. EU regulations that forbid medical or financial records crossing a border. National-security workloads that cannot touch a public cloud. Federation lets you keep every byte inside the jurisdiction it belongs to, with the policy enforced by the cluster rather than by procedure.

Geography

Geographic spread. Branches in different time zones, a sales floor in a different country, a manufacturing site at the edge of the network. Local clusters keep latency low; federation keeps everyone working from the same playbook.

Recovery

Disaster recovery. If the primary datacenter burns down, when does work resume? With a warm DR replica, the answer is "the next minute, after an operator confirms the cutover." Without one, the answer involves restoring from backups.

Sovereignty

Sovereign deployment. Air-gapped networks. Government clouds. On-premises by mandate. Eldric clusters do not depend on any external service to elect a leader, find a peer, or validate a license — the whole loop closes inside your perimeter.

§7 — Honest roadmap

Today, in flight, on the way.

A three-column ledger of where each piece of the resilience story actually is. Anything not in the leftmost column is not yet running in production — we will not pretend otherwise.

Today ●

5.0 — shipping

Single-node install on any hardware
Clustered workers (inference, RAG, agents)
Routers and edge gateways
Multi-controller Raft consensus + automatic leader election
Automatic failover with sub-10-second detection
Cluster crash-recovery under live test
Gossip mesh for in-cluster discovery
mDNS / Bonjour for local-network discovery
Rolling upgrades across the cluster
Per-tenant data isolation
Hash-chained audit ledger
Backup, restore, and verify of cluster state

In flight ●

Next 5.0.x patches — designed & dispatched

Cross-controller identity replication (full production-HA bootstrap)
Leader-aware client endpoint (clients always reach the current leader)
Zonal awareness (rack / AZ placement hints)
Continuous replication of vector indexes
Continuous replication of matrix memory
Encrypted gossip with mutual TLS between nodes
Native xLSTM inference backend (preview)

On the way ●

Later in 5.0.x — on the roadmap

Multi-site federation across the WAN
Warm disaster-recovery replicas
Per-tenant regional data-residency policy
DNS-SD discovery for cross-subnet clusters
Admin GUI for WAN peer hints
Geographic load shaping (latency-aware routing)
Cross-site session continuity for chat

Roadmap timelines change. Anything in the right two columns is genuinely planned and budgeted — but we will not promise a calendar quarter for software we have not finished. The thing we will promise is that the table will tell you the truth.

Resilient AI infrastructurefor serious deployments.