Both Eldric Client and Controller support multiple backends. Mix local inference with cloud APIs across your infrastructure.

OpenAI-Compatible Streaming

Eldric Streaming Architecture

Universal SSE Streaming

All backends support real-time token streaming via Server-Sent Events (SSE). Use stream: true in your /v1/chat/completions request. The streaming flows seamlessly through Edge → Router → Worker → Backend.

Unified Backend Features

Streaming

  • SSE (Server-Sent Events)
  • Real-time token delivery
  • OpenAI-compatible format
  • Zero-copy proxy

Unified API

  • /v1/chat/completions
  • /v1/models
  • /v1/embeddings
  • Same API for all backends

Load Balancing

  • Round-robin / Least connections
  • AI-powered routing
  • Automatic failover
  • Health monitoring

Multi-Backend

  • Mix local + cloud
  • Fallback chains
  • Per-model routing
  • Hot backend switching

Local & Self-Hosted

Ollama

  • Port: 11434
  • REST API
  • Auto model discovery
  • Default backend

vLLM

  • Port: 8000
  • OpenAI-compatible
  • PagedAttention
  • High throughput

llama.cpp

  • Port: 8080
  • REST + WebSocket
  • GGUF models
  • CPU + GPU

HuggingFace TGI

  • Port: 8080
  • REST + gRPC
  • Tensor parallelism
  • Continuous batching

LocalAI

  • Port: 8080
  • OpenAI-compatible
  • Multiple formats
  • CPU optimized

ExLlamaV2

  • Port: 5000
  • REST API
  • GPTQ/EXL2 quants
  • Fast inference

LMDeploy

  • Port: 23333
  • OpenAI-compatible
  • TurboMind engine
  • Quantization

MLC LLM

  • Port: 8080
  • REST API
  • Universal deploy
  • WebGPU support

Enterprise & ML Platforms

NVIDIA Triton

  • Port: 8000-8002
  • REST + gRPC
  • TensorRT optimization
  • Multi-framework

NVIDIA NIM

  • Port: 8000
  • OpenAI-compatible
  • Optimized containers
  • Enterprise ready

TensorFlow Serving

  • Port: 8501/8500
  • REST + gRPC
  • Model versioning
  • Batch prediction

TorchServe

  • Port: 8080/8081
  • REST + gRPC
  • PyTorch native
  • Model archive

ONNX Runtime

  • Port: 8001
  • REST + gRPC
  • Cross-platform
  • Hardware agnostic

DeepSpeed-MII

  • Port: 28080
  • REST API
  • ZeRO-Inference
  • Low latency

BentoML

  • Port: 3000
  • REST + gRPC
  • Model packaging
  • Adaptive batching

Ray Serve

  • Port: 8000
  • REST API
  • Auto-scaling
  • Distributed

Cloud AI Services

AWS SageMaker

  • HTTPS endpoint
  • REST API
  • Auto-scaling
  • Multi-model

AWS Bedrock

  • HTTPS endpoint
  • REST API
  • Foundation models
  • Managed service

Azure ML

  • HTTPS endpoint
  • REST + SDK
  • Managed compute
  • MLflow integration

Azure OpenAI

  • HTTPS endpoint
  • OpenAI-compatible
  • Enterprise security
  • Regional deploy

Google Vertex AI

  • HTTPS endpoint
  • REST + gRPC
  • TPU support
  • Model Garden

Groq

  • HTTPS API
  • OpenAI-compatible
  • LPU inference
  • Ultra-fast

xAI (Grok)

  • HTTPS API
  • OpenAI-compatible
  • Grok models
  • Vision + Tools

Together AI

  • HTTPS API
  • OpenAI-compatible
  • Open models
  • Fine-tuning

Fireworks AI

  • HTTPS API
  • OpenAI-compatible
  • Fast inference
  • Function calling

Anyscale

  • HTTPS API
  • OpenAI-compatible
  • Ray-based
  • Scalable

Replicate

  • HTTPS API
  • REST API
  • Model hosting
  • Pay-per-use

Model Provider APIs

OpenAI

  • HTTPS API
  • REST API
  • GPT-4, GPT-4o
  • Assistants API

Anthropic

  • HTTPS API
  • REST API
  • Claude models
  • Tool use

Google Gemini

  • HTTPS API
  • REST API
  • Gemini Pro/Ultra
  • Multimodal

Mistral AI

  • HTTPS API
  • OpenAI-compatible
  • Mistral/Mixtral
  • Function calling

Cohere

  • HTTPS API
  • REST API
  • Command models
  • Embeddings + Rerank

AI21 Labs

  • HTTPS API
  • REST API
  • Jurassic models
  • Specialized tasks

Concept-Space Models

Unlike token-based LLMs, Meta LCM operates in semantic concept space—reasoning at a higher level of abstraction before generating text.

Meta LCM (Local)

  • Port: 8000
  • REST API
  • SONAR encoder (200+ langs)
  • Diffusion-based reasoning
  • Concept ↔ Text conversion

Meta LCM Cloud

  • HTTPS API
  • OpenAI-compatible
  • Managed infrastructure
  • High availability
  • Enterprise SLA

Specialized & Platform-Specific

MLX (Apple Silicon)

  • Port: 8080
  • REST API
  • Metal acceleration
  • Unified memory

KServe

  • Port: 8080
  • REST + gRPC
  • Kubernetes native
  • Serverless

Seldon Core

  • Port: 9000
  • REST + gRPC
  • ML deployment
  • A/B testing

OpenAI-Compatible

  • Any port
  • Custom endpoints
  • API key auth
  • Drop-in support

Backend by Use Case

Use Case Recommended Backends Why
Development Ollama, LocalAI, LMDeploy Easy setup, free, local
Production API vLLM, TGI, Triton, NIM High throughput, batching, enterprise
Edge / IoT llama.cpp, MLC LLM, ExLlamaV2 CPU inference, small footprint, quantized
Apple Silicon MLX, Ollama, MLC LLM Metal acceleration, unified memory
Low Latency Groq, xAI, Fireworks, DeepSpeed-MII Optimized hardware, fast inference
Enterprise Cloud Azure OpenAI, Bedrock, Vertex AI Compliance, SLA, managed
Open Models Together AI, Anyscale, Replicate Llama, Mistral, open weights
Kubernetes KServe, Seldon, Ray Serve Cloud-native, auto-scaling

Streaming & Feature Support

Backend Type Streaming Vision Native ToolsXML Tools Embeddings
Ollama Local
vLLM Enterprise
TGI Enterprise
NVIDIA Triton Enterprise
llama.cpp Local
MLX Local (macOS)
OpenAI Cloud
Anthropic Cloud
Groq Cloud
xAI (Grok) Cloud
Together AI Cloud
Azure OpenAI Cloud
Meta LCM Concept-Space

✓ = Supported, — = Not available for this backend

Availability

Eldric Client (CLI + GUI): Ollama, vLLM, llama.cpp, TGI, MLX, OpenAI-compatible endpoints

Eldric Controller: All 35+ backends with unified API, load balancing, streaming, and failover

Advanced Inference Strategies

Beyond single-backend inference, Eldric supports model splitting, request splitting, ensemble methods, and intelligent routing across all backends.

Model Splitting (Pipeline Parallelism)

Split large models across multiple workers. A 70B model that doesn't fit on one GPU is sharded by layers across 2+ workers. Each worker loads only its assigned layers from GGUF via NFS.

# Deploy 70B across 3 workers
POST /api/v1/pipeline/deploy
{ "model_id": "llama-70B", "workers": ["wrk-1","wrk-2","wrk-3"] }

Uses llama.cpp RPC under the hood: head worker runs llama-server --rpc, others run llama-rpc-server.

Full documentation →

Request Splitting (Load Balancing)

Multiple workers run the same model independently. The router distributes incoming requests across workers using AI-powered load balancing, intent detection, and theme-based specialization.

Strategies: round_robin, least_connections, load_based, latency_based, random, ai_routing (LLM-powered decisions)

Router auto-detects model theme (medicine, legal, code) and routes to specialized workers. xLSTM predictor forecasts load spikes.

Router documentation →

Model Swarm (LLM Ensemble)

Send the same query to multiple LLMs simultaneously and combine results. The router's Swarm LLM strategies use multi-model consensus for higher accuracy.

Ensemble strategies:
debate — models argue, judge picks best
critique — first generates, second critiques
best_of_n — N responses, scored by judge
vote — majority consensus across models
Swarm LLM documentation →

xLSTM Methods

Sepp Hochreiter's extended LSTM architecture is integrated throughout the Eldric platform for predictive workloads.

Router: xLSTM predictor for workload forecasting, anomaly detection, fast sequence classification

Training: Native xLSTM training backend with sLSTM and mLSTM cell support

Science: xLSTM anomaly detection on time-series data (seismic, genomic, financial)

Multi-Backend Routing

Route different models to different backends seamlessly. Local Ollama for small models, vLLM for batch inference, Cloud Workers for GPT-4o/Claude, all behind a single OpenAI-compatible API.

Flow: Client → Edge (TLS) → Router (AI decision) → Worker (Ollama/vLLM/Cloud)

Fallback chain: Primary backend → local fallback → cloud fallback
Cloud Worker documentation →

Concept-Space & Latent Prediction

Meta's Large Concept Model (LCM) reasons in concept space rather than token space. VL-JEPA predicts latent video-language representations. Both run as Eldric backends.

LCM: Concept-space reasoning, plan in abstract representations

VL-JEPA: Joint embedding predictive architecture for video-language tasks

Latent reasoning: COCONUT, Quiet-STaR, Pause Tokens, Hidden CoT, DeepSeek DSA