Distributed Inference

Split One LLM Across Multiple Workers

Run models larger than any single node's memory by distributing layers across your Eldric cluster. Every component — Edge, Controller, Router, Worker, Data Worker — is clusterable for full high availability.

The Problem

A 70B parameter model needs ~40 GB in Q4 quantization. Your workers each have 24 GB VRAM.

Llama 3.1 70B (Q4_K_M) — 40 GB 80 Layers · hidden_size 8192 · 40 GB total ✗ Does not fit! Worker 1 — 24 GB VRAM Worker 2 — 16 GB VRAM Worker 3 — 24 GB VRAM Combined: 64 GB VRAM — plenty of room if we split the model

Three Approaches

From zero-change integrations to native Eldric pipeline parallelism

Option A

vLLM Tensor Parallelism

Use vLLM's built-in tensor parallelism with Ray. Model slices split across GPUs. Eldric worker proxies to the vLLM cluster. Zero Eldric code changes.

Ready today Needs CUDA + fast interconnect
Option B
🔗

llama.cpp RPC Distribution

Use llama.cpp's native RPC to split GGUF layers across workers. Head node coordinates, RPC servers on each worker hold their layers. Works with existing Ollama models.

Ready today CPU + GPU, any hardware
Option C
🚀

Native Eldric Pipeline

Eldric-native distributed inference. Data Worker stores models, Controller assigns layers, Workers pull only their shards. Full cluster integration.

Prototype Phased implementation
Option A

vLLM Tensor Parallelism

Every GPU holds a slice of every layer. All-to-all communication each token.

Router :8881 Eldric Worker :8890 (proxy to vLLM) vLLM Head :8000 Ray Cluster (Tensor Parallel) GPU 0 (24 GB) Slice of ALL layers GPU 1 (16 GB) Slice of ALL layers GPU 2 (24 GB) Slice of ALL layers all-to-all sync all-to-all sync
# On worker1 (head) — runs vLLM with tensor parallelism python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 3 \ --distributed-executor-backend ray \ --host 0.0.0.0 --port 8000 # On worker2 & worker3 — join Ray cluster ray start --address=worker1:6379 # One Eldric worker proxies to vLLM ./eldric-workerd --backend vllm --backend-url http://worker1:8000 --port 8890
💡
Best for: CUDA GPU clusters with fast NVLink/InfiniBand interconnect. Every token generation requires all GPUs to synchronize, so network latency directly impacts inference speed.
Option B

llama.cpp RPC Distribution

Pipeline by layers. Each worker holds a contiguous range. Data flows sequentially.

Router :8881 Eldric Worker :8890 llama-server (Head) :8080 --rpc worker2:50052,worker3:50052 Layer offload via RPC llama-rpc-server :50052 Worker 2 llama-rpc-server :50052 Worker 3 Layer Distribution Head: Layers 0-26 + Embed RPC 1: Layers 27-52 RPC 2: Layers 53-79 + Output
# On worker2 & worker3 — run RPC layer servers ./llama-rpc-server --host 0.0.0.0 --port 50052 # On worker1 (head) — splits model across RPC workers ./llama-server \ --model /models/llama-3.1-70B-Q4_K_M.gguf \ --rpc worker2:50052,worker3:50052 \ --n-gpu-layers 99 \ --host 0.0.0.0 --port 8080 # Eldric worker proxies to llama-server ./eldric-workerd --backend llama_cpp --backend-url http://worker1:8080 --port 8890
Best for: Heterogeneous hardware (mixed GPU/CPU, Apple Silicon + CUDA). Works with GGUF models you already use in Ollama. No Ray dependency, no CUDA requirement. Standard TCP networking.
Option C — Native Eldric Pipeline

Full HA Cluster with Distributed Inference

Every component is clusterable. Multiple Edge servers, Controllers, Routers — and the model is split across Workers via the Data Worker model store.

External Clients (OpenWebUI, Apps, curl, SDKs) EDGE FARM — TLS, Auth, Rate Limiting Edge 1 :443 Edge 2 :443 Edge 3 :443 sync sync CONTROLLER CLUSTER Controller 1 Controller 2 Pipeline Coordinator Layer Assignment Engine VRAM-aware · HA failover · Heartbeat monitor :8880 each · shared state sync ROUTER POOL — AI-Powered Load Balancing Router 1 :8881 Router 2 :8881 Router 3 :8881 route to head PIPELINE WORKERS Worker 1 :8890 HEAD Tokenize → Embed → Layers 0-26 → Forward downstream 24 GB VRAM · embed + blk.0–blk.26 L 0-26 embed TCP: Hidden State [8192] Worker 2 :8890 MIDDLE Receive → Layers 27-52 → Forward downstream 16 GB VRAM · blk.27–blk.52 L 27-52 TCP: Hidden State [8192] Worker 3 :8890 TAIL Receive → Layers 53-79 → Output → Sample token 24 GB VRAM · blk.53–blk.79 + output_norm + lm_head L 53-79 output Response (tokens) DATA WORKER CLUSTER — Model Store + NFS Data Worker 1 Data Worker 2 repl 📦 llama-3.1-70B-Q4_K_M.gguf 40 GB · 80 layers NFS :2049 mmap & seek HTTP Range (WAN) API /models/metadata · /tensors · /pull · Vector/RAG storage DB SQLite · PostgreSQL · MySQL · DB2 · Multi-tenant pull L0-26 pull L27-52 pull L53-79 Every component is clusterable: Edge Farm (HA, TLS) Controller Cluster Router Pool Pipeline Head Pipeline Middle Data Workers (repl) High Availability Features Edge: Farm mode with peer sync, shared rate-limit state, automatic failover via DNS/LB Controller: Active-active cluster, shared pipeline state, worker re-assignment on failure Router: Stateless pool, any router serves any request, syncs worker list from controller. Pipeline models auto-discovered from head workers.

GGUF Files Are Layer-Addressable

Each tensor in a GGUF file has a named offset. Workers seek directly to their assigned layers — no full download needed.

llama-3.1-70B-Q4_K_M.gguf (40 GB) HEADER metadata embed 512MB blk.0 – blk.26 ~13.5 GB blk.27 – blk.52 ~13 GB blk.53 – blk.79 ~13.5 GB output 512MB 0x0 offset: 0x20000000 offset: 0x380000000 offset: 0x6C0000000 NFS mmap + seek Worker 1: seek to 0x200... Loads only embed + blk.0-26 into VRAM Worker 2: seek to 0x380... Loads only blk.27-52 into VRAM Worker 3: seek to 0x6C0... Loads only blk.53-79 + output into VRAM

Deployment Flow

How a distributed model gets deployed across the cluster

1 Upload Model to Data Worker curl -X POST dataworker:8892/api/v1/storage/upload -F "file=@llama-70B-Q4.gguf" -F "path=models/" 2 Request Distributed Deployment POST controller:8880/api/v1/pipeline/deploy { "model": "llama-70B", "workers": ["wrk-1","wrk-2","wrk-3"] } 3 Controller Analyzes & Plans Reads GGUF metadata from Data Worker → 80 layers, hidden_size=8192 Queries worker VRAM: wrk-1=24GB, wrk-2=16GB, wrk-3=24GB → Assigns layers proportionally 4 Push Shard Config to Workers POST wrk-N:8890/api/v1/pipeline/load { layers: [start, end], role: "head|middle|tail", model_nfs: "/mnt/..." } 5 Workers Pull Their Layers via NFS Each worker mmap()s the GGUF file, seeks to assigned tensor offsets, loads only those tensors into VRAM Model "llama-3.1-70B" now available in Router model list — requests route to head worker transparently

Model Pull Strategies

Workers can pull model layers via two mechanisms depending on network topology

📁

NFS Mount (LAN)

Workers mount Data Worker's /models/ directory via NFS (:2049). The GGUF file is mmap()'d — the OS handles caching and page faults. Workers seek directly to their tensor offsets. Zero-copy, fastest option.

# Worker mounts data worker NFS mount -t nfs dataworker:/models /mnt/models # Worker opens GGUF, seeks to layer 27 offset mmap("/mnt/models/llama-70B.gguf") seek(0x380000000) # Layer 27 start
Already implemented Same datacenter
🌐

HTTP Range Requests (WAN)

For cross-site deployments, workers use HTTP Range requests to download only their assigned byte ranges. Data Worker returns partial content (HTTP 206). Slower but works over any network.

# Worker requests only layers 27-52 GET /api/v1/models/llama-70B/pull Range: bytes=14965800960-28991029247 # Data Worker responds: HTTP 206 Partial Content # 13 GB of layer tensors transferred
New endpoint Cross-datacenter

Every Component Is Clusterable — Worldwide

Spread workers across university labs, corporate datacenters, home offices, and cloud providers. Workers register through the Edge TLS gateway over the internet. Behind NAT? The built-in tunnel needs only outbound connections — no VPN, no public IP.

🌐

Edge Farm

Multiple Edge servers with TLS termination, API key auth, and rate limiting. Farm mode syncs state between peers. DNS or load balancer distributes external traffic.

./eldric-edge --mode farm \ --peers edge2:443,edge3:443
Production ready
🏛️

Controller Cluster

Active-active controllers share pipeline state, worker registry, and license management. If one fails, others continue orchestrating. Up to 5 controllers in Enterprise.

# Enterprise: up to 5 controllers ./eldric-controller --port 8880 \ --cluster-peers ctrl2:8880
Enterprise
⚖️

Router Pool

Stateless routers sync worker lists from controller. Any router can serve any request. AI-powered load balancing, intent detection, theme-based routing. Up to 10 routers.

./eldric-routerd --controller \ http://ctrl:8880 --sync-interval 30s
Production ready
💾

Data Worker Cluster

Multiple data workers with NFS cross-mounting for replication. Model files available from any data worker. Vector/RAG storage, multi-tenant isolation, database connectivity.

./eldric-datad --nfs --vector \ --storage-path /data/eldric
Production ready
🧠

Pipeline Workers

Any inference worker can participate in a pipeline. The controller assigns layers based on available VRAM. Workers join/leave dynamically — rebalance redistributes layers.

# Workers auto-register, controller assigns shards ./eldric-workerd --controller \ http://ctrl:8880
Up to 50 workers
🤖

All Specialized Workers

Science, Training, Media, Agent, Communication, IoT — every worker type supports multiple instances. Register with controller, get load-balanced automatically.

Science :8897 Training :8898 Media :8894 Agent :8893

Cluster Limits by License

Component Free Standard Professional Enterprise
Controllers1125
Routers12410
Inference Workers231050
Edge Servers125Unlimited
Data Workers125Unlimited
Science Workers125Unlimited
Training Workers125Unlimited
Pipeline Inference
HA Failover

Comparison

Choose the right approach for your infrastructure

Aspect Option A
vLLM Tensor Parallel
Option B
llama.cpp RPC
Option C
Eldric Native
Eldric Code Changes None None New subsystem
Distribution Method Tensor slices (all-to-all) Layer offloading (pipeline) Layer pipeline + NFS
Model Format safetensors (HuggingFace) GGUF GGUF + safetensors
Hardware CUDA GPUs required Any (CPU, GPU, Metal) Any (CPU, GPU, Metal)
Network Needs NVLink/InfiniBand (fast!) Standard TCP/1Gbps NFS + TCP
Model Storage Each node downloads full model Head needs full GGUF Data Worker only (shared)
Cluster Awareness External (Ray) Manual setup Full integration
Dashboard Separate None Pipeline view
Auto-rebalance Worker join/leave
Full Cluster HA Single node Single node Multi-Edge, Controller, Router, Data
Implementation Ready today Ready today Phase 1 & 2

Implementation Status

Phase 1 (llama.cpp RPC orchestration) is implemented. Phase 2 (native tensor engine) is planned.

Phase 1 — Implemented

Automated llama.cpp RPC Orchestration

Controller orchestrates llama-rpc-server and llama-server processes across workers. Data Worker serves GGUF metadata. Full subprocess lifecycle management.

GGUF Parser + Model Registry

Reads real GGUF files: layers, hidden_size, tensor index with byte offsets. Data Worker serves /metadata, /tensors, /pull endpoints.

Pipeline Coordinator

VRAM-proportional or balanced layer assignment. Controller fetches metadata, queries worker GPUs, computes optimal split.

Worker Process Management

Workers fork/exec llama-rpc-server (middle/tail) or llama-server --rpc (head). Full process lifecycle: start, health check, graceful SIGTERM, SIGKILL fallback.

Orchestration + Rebalancing

Deploy deploys RPC workers first, then head. Rebalance unloads all, re-queries VRAM, recomputes assignments, redeploys. Undeploy stops all processes.

Router Integration

Heartbeat reports pipeline shards. Router auto-discovers pipeline models from head workers. Controller tracks shard status across heartbeats.

Status: Full pipeline orchestration working. Deploy a GGUF model across N workers with one API call. llama.cpp handles the inference; Eldric manages the cluster lifecycle.
Phase 2 — Planned

Native Pipeline Engine

Replace llama.cpp internals with native tensor transport and partial model loading. Full control over the inference pipeline. Support for safetensors, dynamic rebalancing, and RDMA.

1

Tensor Transport Layer

TCP + optional RDMA for hidden state transfer

2

Partial Model Loader (ggml)

Load individual layer tensors from GGUF via NFS mmap

3

Pipeline Forward Pass

Layer-by-layer inference with KV cache management

4

safetensors + RDMA

Support HuggingFace format, InfiniBand for datacenter

🚀
Goal: Fully native distributed inference with no external dependencies. Drop a model on your cluster and Eldric shards it automatically.

Components & Status

Implementation status of each pipeline component

📦

pipeline_types.h

Full type system: GGUFParser (reads real GGUF files), PipelineCoordinator, ShardAssignment, all enums + JSON serialization.

Implemented
🧠

Controller Pipeline API

Deploy, undeploy, rebalance, status endpoints. Fetches GGUF metadata, computes shard assignments, pushes to workers, tracks via heartbeat.

Implemented
📊

Data Worker Model Registry

GGUF metadata parser serving /metadata, /tensors, /pull. Reads real files from storage. Layer byte-range calculation for HTTP pull.

Implemented
⚙️

Worker Process Manager

fork/exec llama-rpc-server (middle/tail) and llama-server --rpc (head). Process health checks, graceful shutdown, log files.

Implemented
🔗

Router Pipeline Awareness

Auto-discovers pipeline models from head worker heartbeats. Adds distributed models to routing table. Standard load-balancing applies.

Implemented
🚀

Native Tensor Transport

Phase 2: TCP/RDMA hidden state streaming, partial ggml loading, KV cache management. Replaces llama.cpp RPC with Eldric-native engine.

Phase 2 (planned)

New API Endpoints

Component Endpoint Method Description
Data Worker /api/v1/models/registry GET List models with metadata (layers, size, format)
Data Worker /api/v1/models/{id}/metadata GET GGUF header: layer count, hidden dim, tensor map
Data Worker /api/v1/models/{id}/tensors GET List all tensor names, offsets, sizes
Data Worker /api/v1/models/{id}/pull POST Pull specific layers (byte ranges)
Controller /api/v1/pipeline/models GET List distributed models
Controller /api/v1/pipeline/deploy POST Deploy model across workers
Controller /api/v1/pipeline/undeploy POST Remove distributed model
Controller /api/v1/pipeline/status GET Shard health, latency, layer assignments
Controller /api/v1/pipeline/rebalance POST Redistribute layers across workers
Worker /api/v1/pipeline/load POST Load assigned layer shard
Worker /api/v1/pipeline/unload POST Unload shard, free VRAM
Worker /api/v1/pipeline/status GET Shard status, loaded layers, memory

Hidden State Transfer

The critical path — how intermediate tensors flow between pipeline stages

Token generation timeline Worker 1: Embed + L0-26 Tokenize → Embed → 27 attn+ffn layers TCP 16 KB Worker 2: L27-52 26 attn+ffn layers TCP 16 KB Worker 3: L53-79 + Output 27 layers + norm + lm_head + sample Hidden State Tensor [ 0.234 -1.892 0.445 -0.112 ... 2.781 -0.034 ... 1.556 -0.891 0.723 ... -1.234 ] float16[8192] = 16,384 bytes per token · Llama 70B hidden_size = 8192 Transfer cost per token: 16 KB × 2 hops = 32 KB At 1 Gbps LAN: ~0.26 ms latency per hop — negligible vs compute time (~10-50ms per layer group)
💡
Key insight: The hidden state is tiny (16 KB per token) compared to the compute time per layer group. On a standard 1 Gbps LAN, the network overhead adds less than 1ms per pipeline stage. The bottleneck is compute, not transfer — which means pipeline parallelism works well even on commodity networks.

Summary

Split any model across your cluster with one API call. Every component is clusterable for full high availability.

Options A & B
vLLM tensor parallel
llama.cpp RPC — ready today
Phase 1 — Done
Automated RPC orchestration
GGUF parser + subprocess mgmt
🚀
Phase 2 — Planned
Native tensor engine
ggml + RDMA + safetensors