Eldric Native Inference Worker - Direct GGUF & xLSTM Model Serving

Architecture

No Ollama. No vLLM. Direct model loading with embedded llama.cpp and xLSTM runtimes.

Key Features

Everything you need for production-grade native inference, from single-node to distributed clusters.

⚙

Zero External Dependencies

llama.cpp is embedded directly into the binary. No Ollama, no vLLM, no external processes. Just drop GGUF files in a directory and serve them.

📚

GGUF + xLSTM Models

Serve quantized GGUF models (Q2 through Q8, FP16) via llama.cpp and xLSTM models via the NXAI runtime. Automatic format detection.

🌐

OpenAI-Compatible API

Standard /v1/chat/completions with streaming SSE. Drop-in replacement for any OpenAI-compatible client or framework.

🔗

Pipeline Parallelism

Split large models across multiple inference workers. Each node handles a subset of layers, connected via high-speed peer-to-peer communication.

🤖

Swarm Integration

Register with the Swarm controller for multi-agent orchestration. Agents can delegate inference tasks to the nearest available native worker.

📈

Cluster Ready

Register with the controller, get discovered by routers, appear in dashboards. Full heartbeat, health monitoring, and license enforcement.

🕒

Permanent Model Residence

Models stay loaded in VRAM permanently. Zero cold-start latency for every request. No model swapping, no load/unload overhead.

📦

Model Distribution

Automatically fetch models from the Data Worker storage or Controller registry. No manual file copying across nodes.

Native vs. Proxy Workers

The inference worker loads models directly. No middleman.

API Reference

OpenAI-compatible endpoints plus cluster management APIs.

Inference Endpoints

Method	Endpoint	Description
GET	/health	Worker health with GPU info
GET	/dashboard	Web management dashboard
GET	/v1/models	List loaded models (OpenAI format)
POST	/v1/chat/completions	OpenAI-compatible chat (streaming SSE)
POST	/v1/embeddings	Generate embeddings
POST	/api/v1/models/load	Load a model into VRAM
POST	/api/v1/models/unload	Unload a model from VRAM
GET	/api/v1/models/loaded	List currently loaded models with stats
GET	/api/v1/gpu	GPU information (VRAM, utilization)

Swarm & Cluster Endpoints

Method	Endpoint	Description
POST	/api/v1/swarm/task	Receive delegated task from Swarm
GET	/api/v1/swarm/status	Report status to Swarm controller
POST	/api/v1/peers/register	Register a peer for pipeline parallelism
GET	/api/v1/peers	List connected peers
POST	/api/v1/heartbeat	Internal heartbeat to controller

Quick Start

Get native inference running in under a minute.

# Start standalone (simplest)
./eldric-inferenced --model-dir /var/lib/eldric/models --gpu-layers -1

# Join a cluster with Data Worker model distribution
./eldric-inferenced --model-dir /var/lib/eldric/models \
  --controller http://controller:8880 \
  --data-workers http://dataworker:8892 \
  --gpu-layers -1

# Pipeline parallelism with peers
./eldric-inferenced --enable-pipeline --peers http://inf2:8883,http://inf3:8883

# With swarm integration
./eldric-inferenced --controller http://controller:8880 --swarm-url http://swarm:8885

# Chat completion
curl -X POST http://localhost:8883/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2-3b-q8.gguf","messages":[{"role":"user","content":"Hello!"}]}'
            

Configuration Flags

Flag	Description
--model-dir	Directory for GGUF model files
--controller	Controller URL for registration and model registry
--gpu-layers -1	Offload all layers to GPU (-1 = all)
--data-workers	Data worker URLs for model distribution
--swarm-url	Swarm controller for agent coordination
--enable-pipeline	Enable pipeline parallelism mode
--peers	Peer inference worker URLs for pipeline
--preload	Models to load at startup
--extra-model-dir	Additional model search paths

Model Auto-Discovery & Distribution

Place .gguf, .xlstm, or .safetensors files in the model directory. The inference worker scans and discovers them automatically. When connected to a Controller, models are fetched from the model registry. With --data-workers, models can be automatically downloaded from Data Worker storage.

Pipeline Parallelism

Split large models across multiple inference workers. Each handles a subset of layers.

Comparison with Other Backends

How inferenced stacks up against Ollama and vLLM.

Feature	inferenced	Ollama	vLLM
GGUF native	✓	✓	✗
Cold start	None (preloaded)	Model load time	Model load time
GPU memory overhead	Minimal	Ollama runtime	Python runtime
Continuous batching	✓	Limited	✓
Model distribution	Data Worker	Ollama Hub	HuggingFace
xLSTM support	✓	✗	✗
Pipeline parallelism	✓	✗	✓
External dependencies	None	Ollama process	Python + PyTorch
Cluster integration	Controller + Swarm	Standalone	Standalone
CUDA + Metal GPU	✓	✓	CUDA only

License Tiers

Scale from free tier to unlimited enterprise deployments.

Feature	Free	Standard	Professional	Enterprise
Inference workers	1	3	10	Unlimited
GGUF models	✓	✓	✓	✓
xLSTM models	✓	✓	✓	✓
Streaming SSE	✓	✓	✓	✓
Controller registration	✓	✓	✓	✓
Swarm integration	✗	✓	✓	✓
Pipeline parallelism	✗	✗	✓	✓
Model distribution	✗	✗	✓	✓
Multi-GPU	✗	✗	✓	✓
Training integration	✗	✓	✓	✓

Need More Capacity?

Contact license@core.at for custom licensing with unlimited inference workers, pipeline parallelism, and priority support.

Native Inference Worker

Architecture

Key Features

Zero External Dependencies

GGUF + xLSTM Models

OpenAI-Compatible API

Pipeline Parallelism

Swarm Integration

Cluster Ready

Permanent Model Residence

Model Distribution

Native vs. Proxy Workers

API Reference

Inference Endpoints

Swarm & Cluster Endpoints

Quick Start

Configuration Flags

Model Auto-Discovery & Distribution

Pipeline Parallelism

Comparison with Other Backends

License Tiers

Need More Capacity?

Get Started