Run models larger than any single node's memory by distributing layers across your Eldric cluster. Every component — Edge, Controller, Router, Worker, Data Worker — is clusterable for full high availability.
A 70B parameter model needs ~40 GB in Q4 quantization. Your workers each have 24 GB VRAM.
From zero-change integrations to native Eldric pipeline parallelism
Use vLLM's built-in tensor parallelism with Ray. Model slices split across GPUs. Eldric worker proxies to the vLLM cluster. Zero Eldric code changes.
Use llama.cpp's native RPC to split GGUF layers across workers. Head node coordinates, RPC servers on each worker hold their layers. Works with existing Ollama models.
Eldric-native distributed inference. Data Worker stores models, Controller assigns layers, Workers pull only their shards. Full cluster integration.
Every GPU holds a slice of every layer. All-to-all communication each token.
Pipeline by layers. Each worker holds a contiguous range. Data flows sequentially.
Every component is clusterable. Multiple Edge servers, Controllers, Routers — and the model is split across Workers via the Data Worker model store.
Each tensor in a GGUF file has a named offset. Workers seek directly to their assigned layers — no full download needed.
How a distributed model gets deployed across the cluster
Workers can pull model layers via two mechanisms depending on network topology
Workers mount Data Worker's /models/ directory via NFS (:2049). The GGUF file is mmap()'d — the OS handles caching and page faults. Workers seek directly to their tensor offsets. Zero-copy, fastest option.
For cross-site deployments, workers use HTTP Range requests to download only their assigned byte ranges. Data Worker returns partial content (HTTP 206). Slower but works over any network.
Spread workers across university labs, corporate datacenters, home offices, and cloud providers. Workers register through the Edge TLS gateway over the internet. Behind NAT? The built-in tunnel needs only outbound connections — no VPN, no public IP.
Multiple Edge servers with TLS termination, API key auth, and rate limiting. Farm mode syncs state between peers. DNS or load balancer distributes external traffic.
Active-active controllers share pipeline state, worker registry, and license management. If one fails, others continue orchestrating. Up to 5 controllers in Enterprise.
Stateless routers sync worker lists from controller. Any router can serve any request. AI-powered load balancing, intent detection, theme-based routing. Up to 10 routers.
Multiple data workers with NFS cross-mounting for replication. Model files available from any data worker. Vector/RAG storage, multi-tenant isolation, database connectivity.
Any inference worker can participate in a pipeline. The controller assigns layers based on available VRAM. Workers join/leave dynamically — rebalance redistributes layers.
Science, Training, Media, Agent, Communication, IoT — every worker type supports multiple instances. Register with controller, get load-balanced automatically.
| Component | Free | Standard | Professional | Enterprise |
|---|---|---|---|---|
| Controllers | 1 | 1 | 2 | 5 |
| Routers | 1 | 2 | 4 | 10 |
| Inference Workers | 2 | 3 | 10 | 50 |
| Edge Servers | 1 | 2 | 5 | Unlimited |
| Data Workers | 1 | 2 | 5 | Unlimited |
| Science Workers | 1 | 2 | 5 | Unlimited |
| Training Workers | 1 | 2 | 5 | Unlimited |
| Pipeline Inference | ✗ | ✓ | ✓ | ✓ |
| HA Failover | ✗ | ✗ | ✓ | ✓ |
Choose the right approach for your infrastructure
| Aspect | Option A vLLM Tensor Parallel |
Option B llama.cpp RPC |
Option C Eldric Native |
|---|---|---|---|
| Eldric Code Changes | ✓ None | ✓ None | New subsystem |
| Distribution Method | Tensor slices (all-to-all) | Layer offloading (pipeline) | Layer pipeline + NFS |
| Model Format | safetensors (HuggingFace) | GGUF | GGUF + safetensors |
| Hardware | CUDA GPUs required | Any (CPU, GPU, Metal) | Any (CPU, GPU, Metal) |
| Network Needs | NVLink/InfiniBand (fast!) | Standard TCP/1Gbps | NFS + TCP |
| Model Storage | Each node downloads full model | Head needs full GGUF | Data Worker only (shared) |
| Cluster Awareness | ✗ External (Ray) | ✗ Manual setup | ✓ Full integration |
| Dashboard | ✗ Separate | ✗ None | ✓ Pipeline view |
| Auto-rebalance | ✗ | ✗ | ✓ Worker join/leave |
| Full Cluster HA | ✗ Single node | ✗ Single node | ✓ Multi-Edge, Controller, Router, Data |
| Implementation | Ready today | Ready today | Phase 1 & 2 |
Phase 1 (llama.cpp RPC orchestration) is implemented. Phase 2 (native tensor engine) is planned.
Controller orchestrates llama-rpc-server and llama-server processes across workers. Data Worker serves GGUF metadata. Full subprocess lifecycle management.
Reads real GGUF files: layers, hidden_size, tensor index with byte offsets. Data Worker serves /metadata, /tensors, /pull endpoints.
VRAM-proportional or balanced layer assignment. Controller fetches metadata, queries worker GPUs, computes optimal split.
Workers fork/exec llama-rpc-server (middle/tail) or llama-server --rpc (head). Full process lifecycle: start, health check, graceful SIGTERM, SIGKILL fallback.
Deploy deploys RPC workers first, then head. Rebalance unloads all, re-queries VRAM, recomputes assignments, redeploys. Undeploy stops all processes.
Heartbeat reports pipeline shards. Router auto-discovers pipeline models from head workers. Controller tracks shard status across heartbeats.
Replace llama.cpp internals with native tensor transport and partial model loading. Full control over the inference pipeline. Support for safetensors, dynamic rebalancing, and RDMA.
TCP + optional RDMA for hidden state transfer
Load individual layer tensors from GGUF via NFS mmap
Layer-by-layer inference with KV cache management
Support HuggingFace format, InfiniBand for datacenter
Implementation status of each pipeline component
Full type system: GGUFParser (reads real GGUF files), PipelineCoordinator, ShardAssignment, all enums + JSON serialization.
Deploy, undeploy, rebalance, status endpoints. Fetches GGUF metadata, computes shard assignments, pushes to workers, tracks via heartbeat.
GGUF metadata parser serving /metadata, /tensors, /pull. Reads real files from storage. Layer byte-range calculation for HTTP pull.
fork/exec llama-rpc-server (middle/tail) and llama-server --rpc (head). Process health checks, graceful shutdown, log files.
Auto-discovers pipeline models from head worker heartbeats. Adds distributed models to routing table. Standard load-balancing applies.
Phase 2: TCP/RDMA hidden state streaming, partial ggml loading, KV cache management. Replaces llama.cpp RPC with Eldric-native engine.
| Component | Endpoint | Method | Description |
|---|---|---|---|
| Data Worker | /api/v1/models/registry |
GET | List models with metadata (layers, size, format) |
| Data Worker | /api/v1/models/{id}/metadata |
GET | GGUF header: layer count, hidden dim, tensor map |
| Data Worker | /api/v1/models/{id}/tensors |
GET | List all tensor names, offsets, sizes |
| Data Worker | /api/v1/models/{id}/pull |
POST | Pull specific layers (byte ranges) |
| Controller | /api/v1/pipeline/models |
GET | List distributed models |
| Controller | /api/v1/pipeline/deploy |
POST | Deploy model across workers |
| Controller | /api/v1/pipeline/undeploy |
POST | Remove distributed model |
| Controller | /api/v1/pipeline/status |
GET | Shard health, latency, layer assignments |
| Controller | /api/v1/pipeline/rebalance |
POST | Redistribute layers across workers |
| Worker | /api/v1/pipeline/load |
POST | Load assigned layer shard |
| Worker | /api/v1/pipeline/unload |
POST | Unload shard, free VRAM |
| Worker | /api/v1/pipeline/status |
GET | Shard status, loaded layers, memory |
The critical path — how intermediate tensors flow between pipeline stages
Split any model across your cluster with one API call. Every component is clusterable for full high availability.