Eldric Inference Backends

Both Eldric Client and Controller support multiple backends. Mix local inference with cloud APIs across your infrastructure.

OpenAI-Compatible Streaming

Universal SSE Streaming

All backends support real-time token streaming via Server-Sent Events (SSE). Use stream: true in your /v1/chat/completions request. The streaming flows seamlessly through Edge → Router → Worker → Backend.

Unified Backend Features

Streaming

SSE (Server-Sent Events)
Real-time token delivery
OpenAI-compatible format
Zero-copy proxy

Unified API

/v1/chat/completions
/v1/models
/v1/embeddings
Same API for all backends

Load Balancing

Round-robin / Least connections
AI-powered routing
Automatic failover
Health monitoring

Multi-Backend

Mix local + cloud
Fallback chains
Per-model routing
Hot backend switching

Local & Self-Hosted

Ollama

Port: 11434
REST API
Auto model discovery
Default backend

vLLM

Port: 8000
OpenAI-compatible
PagedAttention
High throughput

llama.cpp

Port: 8080
REST + WebSocket
GGUF models
CPU + GPU

HuggingFace TGI

Port: 8080
REST + gRPC
Tensor parallelism
Continuous batching

LocalAI

Port: 8080
OpenAI-compatible
Multiple formats
CPU optimized

ExLlamaV2

Port: 5000
REST API
GPTQ/EXL2 quants
Fast inference

LMDeploy

Port: 23333
OpenAI-compatible
TurboMind engine
Quantization

MLC LLM

Port: 8080
REST API
Universal deploy
WebGPU support

Enterprise & ML Platforms

NVIDIA Triton

Port: 8000-8002
REST + gRPC
TensorRT optimization
Multi-framework

NVIDIA NIM

Port: 8000
OpenAI-compatible
Optimized containers
Enterprise ready

TensorFlow Serving

Port: 8501/8500
REST + gRPC
Model versioning
Batch prediction

TorchServe

Port: 8080/8081
REST + gRPC
PyTorch native
Model archive

ONNX Runtime

Port: 8001
REST + gRPC
Cross-platform
Hardware agnostic

DeepSpeed-MII

Port: 28080
REST API
ZeRO-Inference
Low latency

BentoML

Port: 3000
REST + gRPC
Model packaging
Adaptive batching

Ray Serve

Port: 8000
REST API
Auto-scaling
Distributed

Cloud AI Services

AWS SageMaker

HTTPS endpoint
REST API
Auto-scaling
Multi-model

AWS Bedrock

HTTPS endpoint
REST API
Foundation models
Managed service

Azure ML

HTTPS endpoint
REST + SDK
Managed compute
MLflow integration

Azure OpenAI

HTTPS endpoint
OpenAI-compatible
Enterprise security
Regional deploy

Google Vertex AI

HTTPS endpoint
REST + gRPC
TPU support
Model Garden

Groq

HTTPS API
OpenAI-compatible
LPU inference
Ultra-fast

xAI (Grok)

HTTPS API
OpenAI-compatible
Grok models
Vision + Tools

Together AI

HTTPS API
OpenAI-compatible
Open models
Fine-tuning

Fireworks AI

HTTPS API
OpenAI-compatible
Fast inference
Function calling

Anyscale

HTTPS API
OpenAI-compatible
Ray-based
Scalable

Replicate

HTTPS API
REST API
Model hosting
Pay-per-use

Model Provider APIs

OpenAI

HTTPS API
REST API
GPT-4, GPT-4o
Assistants API

Anthropic

HTTPS API
REST API
Claude models
Tool use

Google Gemini

HTTPS API
REST API
Gemini Pro/Ultra
Multimodal

Mistral AI

HTTPS API
OpenAI-compatible
Mistral/Mixtral
Function calling

Cohere

HTTPS API
REST API
Command models
Embeddings + Rerank

AI21 Labs

HTTPS API
REST API
Jurassic models
Specialized tasks

Concept-Space Models

Unlike token-based LLMs, Meta LCM operates in semantic concept space—reasoning at a higher level of abstraction before generating text.

Meta LCM (Local)

Port: 8000
REST API
SONAR encoder (200+ langs)
Diffusion-based reasoning
Concept ↔ Text conversion

Meta LCM Cloud

HTTPS API
OpenAI-compatible
Managed infrastructure
High availability
Enterprise SLA

Specialized & Platform-Specific

MLX (Apple Silicon)

Port: 8080
REST API
Metal acceleration
Unified memory

KServe

Port: 8080
REST + gRPC
Kubernetes native
Serverless

Seldon Core

Port: 9000
REST + gRPC
ML deployment
A/B testing

OpenAI-Compatible

Any port
Custom endpoints
API key auth
Drop-in support

Backend by Use Case

Use Case	Recommended Backends	Why
Development	Ollama, LocalAI, LMDeploy	Easy setup, free, local
Production API	vLLM, TGI, Triton, NIM	High throughput, batching, enterprise
Edge / IoT	llama.cpp, MLC LLM, ExLlamaV2	CPU inference, small footprint, quantized
Apple Silicon	MLX, Ollama, MLC LLM	Metal acceleration, unified memory
Low Latency	Groq, xAI, Fireworks, DeepSpeed-MII	Optimized hardware, fast inference
Enterprise Cloud	Azure OpenAI, Bedrock, Vertex AI	Compliance, SLA, managed
Open Models	Together AI, Anyscale, Replicate	Llama, Mistral, open weights
Kubernetes	KServe, Seldon, Ray Serve	Cloud-native, auto-scaling

Streaming & Feature Support

Backend	Type	Streaming	Vision	Native Tools	XML Tools	Embeddings
Ollama	Local	✓	✓	✓	✓	✓	✓
vLLM	Enterprise	✓	✓	✓	✓	✓	✓
TGI	Enterprise	✓	✓	—	✓	—
NVIDIA Triton	Enterprise	✓	✓	—	✓	✓
llama.cpp	Local	✓	✓	—	✓	✓
MLX	Local (macOS)	✓	—	✓	—	—
OpenAI	Cloud	✓	✓	✓	✓	✓	✓
Anthropic	Cloud	✓	✓	✓	✓	—
Groq	Cloud	✓	✓	—	✓	✓	—
xAI (Grok)	Cloud	✓	✓	✓	✓	—
Together AI	Cloud	✓	✓	✓	✓	✓	✓
Azure OpenAI	Cloud	✓	✓	✓	✓	✓	✓
Meta LCM	Concept-Space	✓	—	—	✓

✓ = Supported, — = Not available for this backend

Availability

Advanced Inference Strategies

Beyond single-backend inference, Eldric supports model splitting, request splitting, ensemble methods, and intelligent routing across all backends.

⚖

Model Splitting (Pipeline Parallelism)

Split large models across multiple workers. A 70B model that doesn't fit on one GPU is sharded by layers across 2+ workers. Each worker loads only its assigned layers from GGUF via NFS.

                        # Deploy 70B across 3 workers

                        POST /api/v1/pipeline/deploy

                        { "model_id": "llama-70B", "workers": ["wrk-1","wrk-2","wrk-3"] }

Uses llama.cpp RPC under the hood: head worker runs llama-server --rpc, others run llama-rpc-server.

Full documentation →

⚖

Request Splitting (Load Balancing)

Multiple workers run the same model independently. The router distributes incoming requests across workers using AI-powered load balancing, intent detection, and theme-based specialization.

Strategies: round_robin, least_connections, load_based, latency_based, random, ai_routing (LLM-powered decisions)

Router auto-detects model theme (medicine, legal, code) and routes to specialized workers. xLSTM predictor forecasts load spikes.

Router documentation →

⚖

Model Swarm (LLM Ensemble)

Send the same query to multiple LLMs simultaneously and combine results. The router's Swarm LLM strategies use multi-model consensus for higher accuracy.

Ensemble strategies:
debate — models argue, judge picks best
critique — first generates, second critiques
best_of_n — N responses, scored by judge
vote — majority consensus across models

Swarm LLM documentation →

⚖

xLSTM Methods

Sepp Hochreiter's extended LSTM architecture is integrated throughout the Eldric platform for predictive workloads.

Router: xLSTM predictor for workload forecasting, anomaly detection, fast sequence classification

Training: Native xLSTM training backend with sLSTM and mLSTM cell support

Science: xLSTM anomaly detection on time-series data (seismic, genomic, financial)

⚖

Multi-Backend Routing

Route different models to different backends seamlessly. Local Ollama for small models, vLLM for batch inference, Cloud Workers for GPT-4o/Claude, all behind a single OpenAI-compatible API.

Flow: Client → Edge (TLS) → Router (AI decision) → Worker (Ollama/vLLM/Cloud)

Fallback chain: Primary backend → local fallback → cloud fallback

Cloud Worker documentation →

⚖

Concept-Space & Latent Prediction

Meta's Large Concept Model (LCM) reasons in concept space rather than token space. VL-JEPA predicts latent video-language representations. Both run as Eldric backends.

LCM: Concept-space reasoning, plan in abstract representations

VL-JEPA: Joint embedding predictive architecture for video-language tasks

Latent reasoning: COCONUT, Quiet-STaR, Pause Tokens, Hidden CoT, DeepSeek DSA