Port 8883 · Native Model Loading

Native Inference Worker

Load GGUF and xLSTM models directly into memory without external backends. Zero-dependency inference with embedded llama.cpp, clusterable deployment, swarm integration, and pipeline-parallel model splitting across nodes.

GGUF + xLSTM
Model Formats
0
External Dependencies
SSE
Streaming
Pipeline
Parallel

Architecture

No Ollama. No vLLM. Direct model loading with embedded llama.cpp and xLSTM runtimes.

eldric-inferenced Architecture Controller :8880 Inference Worker 1 :8883 · inf-abc123 Inference Worker 2 :8883 · inf-def456 Inference Worker 3 :8883 · inf-ghi789 pipeline pipeline llama.cpp xLSTM llama.cpp xLSTM llama.cpp xLSTM llama3.2-3b.gguf qwen2.5-7b.gguf mistral-7b.gguf Swarm Controller :8885 All workers register with controller · Swarm delegates inference tasks · Peers communicate for pipeline parallelism

Key Features

Everything you need for production-grade native inference, from single-node to distributed clusters.

Zero External Dependencies

llama.cpp is embedded directly into the binary. No Ollama, no vLLM, no external processes. Just drop GGUF files in a directory and serve them.

📚

GGUF + xLSTM Models

Serve quantized GGUF models (Q2 through Q8, FP16) via llama.cpp and xLSTM models via the NXAI runtime. Automatic format detection.

🌐

OpenAI-Compatible API

Standard /v1/chat/completions with streaming SSE. Drop-in replacement for any OpenAI-compatible client or framework.

🔗

Pipeline Parallelism

Split large models across multiple inference workers. Each node handles a subset of layers, connected via high-speed peer-to-peer communication.

🤖

Swarm Integration

Register with the Swarm controller for multi-agent orchestration. Agents can delegate inference tasks to the nearest available native worker.

📈

Cluster Ready

Register with the controller, get discovered by routers, appear in dashboards. Full heartbeat, health monitoring, and license enforcement.

🕒

Permanent Model Residence

Models stay loaded in VRAM permanently. Zero cold-start latency for every request. No model swapping, no load/unload overhead.

📦

Model Distribution

Automatically fetch models from the Data Worker storage or Controller registry. No manual file copying across nodes.

Native vs. Proxy Workers

The inference worker loads models directly. No middleman.

Traditional Worker (Proxy) eldric-workerd :8890 HTTP proxy Ollama / vLLM / TGI loads model Model (GGUF) 3 processes · 2 hops · external dependency VS Native Inference Worker eldric-inferenced :8883 embedded llama.cpp + xLSTM OpenAI API + dashboard direct mmap Model (GGUF / xLSTM) 1 process · 0 hops · zero dependencies

API Reference

OpenAI-compatible endpoints plus cluster management APIs.

Inference Endpoints

MethodEndpointDescription
GET/healthWorker health with GPU info
GET/dashboardWeb management dashboard
GET/v1/modelsList loaded models (OpenAI format)
POST/v1/chat/completionsOpenAI-compatible chat (streaming SSE)
POST/v1/embeddingsGenerate embeddings
POST/api/v1/models/loadLoad a model into VRAM
POST/api/v1/models/unloadUnload a model from VRAM
GET/api/v1/models/loadedList currently loaded models with stats
GET/api/v1/gpuGPU information (VRAM, utilization)

Swarm & Cluster Endpoints

MethodEndpointDescription
POST/api/v1/swarm/taskReceive delegated task from Swarm
GET/api/v1/swarm/statusReport status to Swarm controller
POST/api/v1/peers/registerRegister a peer for pipeline parallelism
GET/api/v1/peersList connected peers
POST/api/v1/heartbeatInternal heartbeat to controller

Quick Start

Get native inference running in under a minute.

# Start standalone (simplest) ./eldric-inferenced --model-dir /var/lib/eldric/models --gpu-layers -1 # Join a cluster with Data Worker model distribution ./eldric-inferenced --model-dir /var/lib/eldric/models \ --controller http://controller:8880 \ --data-workers http://dataworker:8892 \ --gpu-layers -1 # Pipeline parallelism with peers ./eldric-inferenced --enable-pipeline --peers http://inf2:8883,http://inf3:8883 # With swarm integration ./eldric-inferenced --controller http://controller:8880 --swarm-url http://swarm:8885 # Chat completion curl -X POST http://localhost:8883/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.2-3b-q8.gguf","messages":[{"role":"user","content":"Hello!"}]}'

Configuration Flags

FlagDescription
--model-dirDirectory for GGUF model files
--controllerController URL for registration and model registry
--gpu-layers -1Offload all layers to GPU (-1 = all)
--data-workersData worker URLs for model distribution
--swarm-urlSwarm controller for agent coordination
--enable-pipelineEnable pipeline parallelism mode
--peersPeer inference worker URLs for pipeline
--preloadModels to load at startup
--extra-model-dirAdditional model search paths

Model Auto-Discovery & Distribution

Place .gguf, .xlstm, or .safetensors files in the model directory. The inference worker scans and discovers them automatically. When connected to a Controller, models are fetched from the model registry. With --data-workers, models can be automatically downloaded from Data Worker storage.

Pipeline Parallelism

Split large models across multiple inference workers. Each handles a subset of layers.

Llama 3.1 70B — Split across 3 inference workers Layers 0-26 (Head) Layers 27-53 (Middle) Layers 54-79 (Tail) Inference Worker 1 24 GB VRAM · Head node Inference Worker 2 16 GB VRAM · Middle node Inference Worker 3 24 GB VRAM · Tail node activations activations Combined: 64 GB VRAM — Run 70B models that don't fit on a single GPU

Comparison with Other Backends

How inferenced stacks up against Ollama and vLLM.

Feature inferenced Ollama vLLM
GGUF native
Cold startNone (preloaded)Model load timeModel load time
GPU memory overheadMinimalOllama runtimePython runtime
Continuous batchingLimited
Model distributionData WorkerOllama HubHuggingFace
xLSTM support
Pipeline parallelism
External dependenciesNoneOllama processPython + PyTorch
Cluster integrationController + SwarmStandaloneStandalone
CUDA + Metal GPUCUDA only

License Tiers

Scale from free tier to unlimited enterprise deployments.

Feature Free Standard Professional Enterprise
Inference workers 1 3 10 Unlimited
GGUF models
xLSTM models
Streaming SSE
Controller registration
Swarm integration
Pipeline parallelism
Model distribution
Multi-GPU
Training integration

Need More Capacity?

Contact license@core.at for custom licensing with unlimited inference workers, pipeline parallelism, and priority support.

Get Started

Download the Eldric distributed package and start serving models natively on your own infrastructure.

Download Eldric View Licensing