Load GGUF and xLSTM models directly into memory without external backends. Zero-dependency inference with embedded llama.cpp, clusterable deployment, swarm integration, and pipeline-parallel model splitting across nodes.
No Ollama. No vLLM. Direct model loading with embedded llama.cpp and xLSTM runtimes.
Everything you need for production-grade native inference, from single-node to distributed clusters.
llama.cpp is embedded directly into the binary. No Ollama, no vLLM, no external processes. Just drop GGUF files in a directory and serve them.
Serve quantized GGUF models (Q2 through Q8, FP16) via llama.cpp and xLSTM models via the NXAI runtime. Automatic format detection.
Standard /v1/chat/completions with streaming SSE. Drop-in replacement for any OpenAI-compatible client or framework.
Split large models across multiple inference workers. Each node handles a subset of layers, connected via high-speed peer-to-peer communication.
Register with the Swarm controller for multi-agent orchestration. Agents can delegate inference tasks to the nearest available native worker.
Register with the controller, get discovered by routers, appear in dashboards. Full heartbeat, health monitoring, and license enforcement.
Models stay loaded in VRAM permanently. Zero cold-start latency for every request. No model swapping, no load/unload overhead.
Automatically fetch models from the Data Worker storage or Controller registry. No manual file copying across nodes.
The inference worker loads models directly. No middleman.
OpenAI-compatible endpoints plus cluster management APIs.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Worker health with GPU info |
| GET | /dashboard | Web management dashboard |
| GET | /v1/models | List loaded models (OpenAI format) |
| POST | /v1/chat/completions | OpenAI-compatible chat (streaming SSE) |
| POST | /v1/embeddings | Generate embeddings |
| POST | /api/v1/models/load | Load a model into VRAM |
| POST | /api/v1/models/unload | Unload a model from VRAM |
| GET | /api/v1/models/loaded | List currently loaded models with stats |
| GET | /api/v1/gpu | GPU information (VRAM, utilization) |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/swarm/task | Receive delegated task from Swarm |
| GET | /api/v1/swarm/status | Report status to Swarm controller |
| POST | /api/v1/peers/register | Register a peer for pipeline parallelism |
| GET | /api/v1/peers | List connected peers |
| POST | /api/v1/heartbeat | Internal heartbeat to controller |
Get native inference running in under a minute.
| Flag | Description |
|---|---|
| --model-dir | Directory for GGUF model files |
| --controller | Controller URL for registration and model registry |
| --gpu-layers -1 | Offload all layers to GPU (-1 = all) |
| --data-workers | Data worker URLs for model distribution |
| --swarm-url | Swarm controller for agent coordination |
| --enable-pipeline | Enable pipeline parallelism mode |
| --peers | Peer inference worker URLs for pipeline |
| --preload | Models to load at startup |
| --extra-model-dir | Additional model search paths |
Place .gguf, .xlstm, or .safetensors files in the model directory. The inference worker scans and discovers them automatically. When connected to a Controller, models are fetched from the model registry. With --data-workers, models can be automatically downloaded from Data Worker storage.
Split large models across multiple inference workers. Each handles a subset of layers.
How inferenced stacks up against Ollama and vLLM.
| Feature | inferenced | Ollama | vLLM |
|---|---|---|---|
| GGUF native | ✓ | ✓ | ✗ |
| Cold start | None (preloaded) | Model load time | Model load time |
| GPU memory overhead | Minimal | Ollama runtime | Python runtime |
| Continuous batching | ✓ | Limited | ✓ |
| Model distribution | Data Worker | Ollama Hub | HuggingFace |
| xLSTM support | ✓ | ✗ | ✗ |
| Pipeline parallelism | ✓ | ✗ | ✓ |
| External dependencies | None | Ollama process | Python + PyTorch |
| Cluster integration | Controller + Swarm | Standalone | Standalone |
| CUDA + Metal GPU | ✓ | ✓ | CUDA only |
Scale from free tier to unlimited enterprise deployments.
| Feature | Free | Standard | Professional | Enterprise |
|---|---|---|---|---|
| Inference workers | 1 | 3 | 10 | Unlimited |
| GGUF models | ✓ | ✓ | ✓ | ✓ |
| xLSTM models | ✓ | ✓ | ✓ | ✓ |
| Streaming SSE | ✓ | ✓ | ✓ | ✓ |
| Controller registration | ✓ | ✓ | ✓ | ✓ |
| Swarm integration | ✗ | ✓ | ✓ | ✓ |
| Pipeline parallelism | ✗ | ✗ | ✓ | ✓ |
| Model distribution | ✗ | ✗ | ✓ | ✓ |
| Multi-GPU | ✗ | ✗ | ✓ | ✓ |
| Training integration | ✗ | ✓ | ✓ | ✓ |
Contact license@core.at for custom licensing with unlimited inference workers, pipeline parallelism, and priority support.
Download the Eldric distributed package and start serving models natively on your own infrastructure.
Download Eldric View Licensing