Clustering & high availability

Resilient AI infrastructure
for serious deployments.

Eldric clusters survive node failures, network partitions, and entire datacenter losses — automatically. From a single developer laptop to a 50+ node federation spanning continents, the same software, the same configuration grammar, the same install package.

§1 — How Eldric clusters

One package. Five postures.

Available today

The same eldric-aios RPM runs on a Raspberry Pi sitting under a developer's monitor and on a 50-node bare-metal cluster behind a hospital network. The role each node plays is configuration, not code. Start with one. Grow to three the day uptime starts mattering. Federate across sites the day you open a second office.

No re-platforming. Every cluster posture below is the same binary with a different config.yaml — the upgrade path from "single laptop trial" to "multi-region enterprise" is a question of nodes you add, not software you migrate to.
Eldric topology evolution from one to fifty nodes single 1 node eldric Dev laptop · Pi 4 · branch lab quorum 3 nodes lead Office cluster · always-on scaled 10 nodes Department · multi-model enterprise 50+ nodes Multi-site · multi-region grows in place — no re-platforming
Fig. 01 Eldric scales from a single developer laptop to a 50+ node enterprise cluster. The same software ships in all five postures — node count is a configuration choice, not a product tier. Leader-eligible nodes are marked in terracotta; workers in navy.

§2 — Automatic failover

When a node falls, the cluster catches it.

Production-grade · two gates

Production-grade multi-controller Raft consensus. Every node in an Eldric cluster keeps a heartbeat to its peers. When a leader disappears — the host dies, the network drops, an admin pulls the wrong cable — the remaining nodes notice within seconds and elect a new leader. Cluster forms, replicates, fails over, and crash-recovers under live test. Inference traffic re-routes automatically. Open chat sessions continue from the next request.

Honest scope. The consensus story is real and validated end-to-end under production-like conditions. The full production-HA bootstrap story still has two pieces in flight for an upcoming 5.0.x patch: cross-controller identity replication (so a recovered controller picks up the cluster's identity state) and a leader-aware client endpoint (so clients always reach the current leader without manual reconfiguration). Both are designed and dispatched; both close cleanly.
Failover sequence — healthy, leader failure, detection, new leader t = 0 Healthy Heartbeats every 1s t ≈ 1s Leader fails Heartbeat missed t ≈ 5s Detection vote vote Election in progress t ≈ 6s New leader lead Traffic resumed
Fig. 02 Failover sequence: a healthy three-node quorum detects a leader loss in roughly five seconds and elects a replacement before most in-flight requests would even time out. The Raft-style election guarantees exactly one leader at a time, even during network partitions.

A cluster that needs a human in the loop to recover is not a cluster — it's a single point of failure with extra hops.


§3 — Multi-site federation

One brain. Many bodies.

On the roadmap — later in 5.0.x

An enterprise rarely lives in one building. Federation lets each site run its own autonomous Eldric cluster — keeping the bandwidth-heavy traffic local — while a federation layer keeps directories, knowledge bases, and access policy in sync. A branch office stays useful when the WAN link drops. The headquarters keeps a warm replica of every branch for regional disaster recovery.

Why federation, not one big cluster? Sub-millisecond consensus does not survive a transatlantic round trip. Federation accepts that physics and works with it — local clusters for hot paths, federated sync for eventual consistency.
Multi-site federation: HQ, three branches, disaster-recovery backup federation link federation link headquarters Vienna 10-node cluster · primary branch London 3-node · autonomous branch Singapore 3-node · autonomous Frankfurt · branch DR replica · warm federation sync
Fig. 03 A federated Eldric deployment: a primary Vienna headquarters, three regional branches, and a warm disaster-recovery replica. Each site holds its own autonomous cluster; the federation layer keeps directories, RAG corpora, and policy in sync.

Branch autonomy

Each site stays useful when the WAN drops.

A federated cluster does not turn into a thin client. The branch keeps its local models, its local RAG corpus, and its local users — and reconciles state with the rest of the federation when the link returns.

Regional sovereignty

Data stays where the law says it must.

Per-tenant policy lets you mark knowledge bases as EU-only, US-only, or single-site. The federation layer refuses to replicate anything tagged outside its allowed regions — a rule the cluster enforces, not the operator.

Disaster recovery

Lose a building, keep the work.

The warm DR replica receives a continuous stream of federation updates. If the primary datacenter is offline — fire, flood, fibre cut — the replica is already current to within minutes. Cutover is an operator decision, not a recovery project.


§4 — Smart discovery

Nodes find each other without a spreadsheet.

Layers 1+2 today

The hardest part of running a distributed system is usually not the consensus — it is keeping track of which machine is at which address. Eldric stacks four discovery layers so an operator does not maintain a list. On a quiet office LAN, nodes find each other in seconds. On a sprawled WAN, a small handful of admin hints kicks off the chain.

Today. mDNS plus the in-cluster gossip protocol cover any single-site deployment. The remaining two layers — DNS-SD and admin WAN hints — arrive with multi-site federation later in 5.0.x.
Four-layer discovery stack — mDNS, gossip, DNS-SD, admin hints Layer 1 · mDNS / Bonjour Same subnet, zero config — the office LAN case. available today Layer 2 · in-cluster gossip Each new node learns the rest of the cluster from the first peer it meets. available today Layer 3 · DNS-SD When the cluster spans subnets, a single DNS record advertises every site to the federation. soon Layer 4 · admin hints (WAN) For air-gapped or restricted networks, a small set of static peer addresses bootstraps the rest. soon
Fig. 04 Four layers of discovery, attempted in order. On a single LAN, layer one is usually enough; layer two carries everything once the cluster is warm. Layers three and four arrive with multi-site federation.

§5 — Deployment scales

What fits each posture.

An at-a-glance map from node count to what it gives you. Every row is the same Eldric package — the differences are in how many nodes you give it, and what role you assign each.

Nodes Topology Failover Typical use case
1 single host none (restart only) Developer laptop, single-team trial, Raspberry Pi proof of concept. Same package — just one of it.
2–3 quorum cluster automatic Small office, always-on deployment. Three nodes is the sweet spot — survives any single node loss with a clean majority vote.
4–10 leader + workers automatic Department or mid-size company. A handful of leader-eligible nodes; the rest are dedicated workers running models, RAG indexes, or specialist agents.
10–50 multi-role mesh automatic + zonal Enterprise on a single site. Workers split across racks or availability zones so a rack-level loss is survivable. Leader stays in a different zone.
50+ federated mesh automatic + cross-site Multi-site, multi-region. Per-site clusters federate over the WAN; warm DR replicas in another region; per-tenant data-residency rules enforced at the federation layer.

§6 — When you need this

Five reasons to cluster Eldric.

A single-node install is genuinely production-grade for a lot of teams. Clustering pays for itself in five specific situations.

Uptime
Uptime-critical AI. Internal copilots that the whole company depends on. Customer-facing chat baked into a product. Workflows where "AI is down" means humans cannot work. Three nodes give you any-single-node tolerance; ten give you any-rack tolerance.
Residency
Regulatory data residency. EU regulations that forbid medical or financial records crossing a border. National-security workloads that cannot touch a public cloud. Federation lets you keep every byte inside the jurisdiction it belongs to, with the policy enforced by the cluster rather than by procedure.
Geography
Geographic spread. Branches in different time zones, a sales floor in a different country, a manufacturing site at the edge of the network. Local clusters keep latency low; federation keeps everyone working from the same playbook.
Recovery
Disaster recovery. If the primary datacenter burns down, when does work resume? With a warm DR replica, the answer is "the next minute, after an operator confirms the cutover." Without one, the answer involves restoring from backups.
Sovereignty
Sovereign deployment. Air-gapped networks. Government clouds. On-premises by mandate. Eldric clusters do not depend on any external service to elect a leader, find a peer, or validate a license — the whole loop closes inside your perimeter.

§7 — Honest roadmap

Today, in flight, on the way.

A three-column ledger of where each piece of the resilience story actually is. Anything not in the leftmost column is not yet running in production — we will not pretend otherwise.

Today

5.0 — shipping

  • Single-node install on any hardware
  • Clustered workers (inference, RAG, agents)
  • Routers and edge gateways
  • Multi-controller Raft consensus + automatic leader election
  • Automatic failover with sub-10-second detection
  • Cluster crash-recovery under live test
  • Gossip mesh for in-cluster discovery
  • mDNS / Bonjour for local-network discovery
  • Rolling upgrades across the cluster
  • Per-tenant data isolation
  • Hash-chained audit ledger
  • Backup, restore, and verify of cluster state

In flight

Next 5.0.x patches — designed & dispatched

  • Cross-controller identity replication (full production-HA bootstrap)
  • Leader-aware client endpoint (clients always reach the current leader)
  • Zonal awareness (rack / AZ placement hints)
  • Continuous replication of vector indexes
  • Continuous replication of matrix memory
  • Encrypted gossip with mutual TLS between nodes
  • Native xLSTM inference backend (preview)

On the way

Later in 5.0.x — on the roadmap

  • Multi-site federation across the WAN
  • Warm disaster-recovery replicas
  • Per-tenant regional data-residency policy
  • DNS-SD discovery for cross-subnet clusters
  • Admin GUI for WAN peer hints
  • Geographic load shaping (latency-aware routing)
  • Cross-site session continuity for chat

Roadmap timelines change. Anything in the right two columns is genuinely planned and budgeted — but we will not promise a calendar quarter for software we have not finished. The thing we will promise is that the table will tell you the truth.


Get started

Try a cluster on what you already have.

A three-node cluster runs comfortably on three commodity Linux boxes. A federated multi-site deployment runs on whatever the IT department already owns. The download is the same; the configuration changes.