Connect to any AI inference endpoint with a unified API
Both Eldric Client and Controller support multiple backends. Mix local inference with cloud APIs across your infrastructure.
All backends support real-time token streaming via Server-Sent Events (SSE). Use stream: true in your /v1/chat/completions request. The streaming flows seamlessly through Edge → Router → Worker → Backend.
Unlike token-based LLMs, Meta LCM operates in semantic concept space—reasoning at a higher level of abstraction before generating text.
| Use Case | Recommended Backends | Why |
|---|---|---|
| Development | Ollama, LocalAI, LMDeploy | Easy setup, free, local |
| Production API | vLLM, TGI, Triton, NIM | High throughput, batching, enterprise |
| Edge / IoT | llama.cpp, MLC LLM, ExLlamaV2 | CPU inference, small footprint, quantized |
| Apple Silicon | MLX, Ollama, MLC LLM | Metal acceleration, unified memory |
| Low Latency | Groq, xAI, Fireworks, DeepSpeed-MII | Optimized hardware, fast inference |
| Enterprise Cloud | Azure OpenAI, Bedrock, Vertex AI | Compliance, SLA, managed |
| Open Models | Together AI, Anyscale, Replicate | Llama, Mistral, open weights |
| Kubernetes | KServe, Seldon, Ray Serve | Cloud-native, auto-scaling |
| Backend | Type | Streaming | Vision | Native Tools | XML Tools | Embeddings | |
|---|---|---|---|---|---|---|---|
| Ollama | Local | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| vLLM | Enterprise | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| TGI | Enterprise | ✓ | ✓ | — | ✓ | — | |
| NVIDIA Triton | Enterprise | ✓ | ✓ | — | ✓ | ✓ | |
| llama.cpp | Local | ✓ | ✓ | — | ✓ | ✓ | |
| MLX | Local (macOS) | ✓ | — | ✓ | — | — | |
| OpenAI | Cloud | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Anthropic | Cloud | ✓ | ✓ | ✓ | ✓ | — | |
| Groq | Cloud | ✓ | ✓ | — | ✓ | ✓ | — |
| xAI (Grok) | Cloud | ✓ | ✓ | ✓ | ✓ | — | |
| Together AI | Cloud | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Azure OpenAI | Cloud | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Meta LCM | Concept-Space | ✓ | — | — | ✓ |
✓ = Supported, — = Not available for this backend
Beyond single-backend inference, Eldric supports model splitting, request splitting, ensemble methods, and intelligent routing across all backends.
Split large models across multiple workers. A 70B model that doesn't fit on one GPU is sharded by layers across 2+ workers. Each worker loads only its assigned layers from GGUF via NFS.
Uses llama.cpp RPC under the hood: head worker runs llama-server --rpc, others run llama-rpc-server.
Multiple workers run the same model independently. The router distributes incoming requests across workers using AI-powered load balancing, intent detection, and theme-based specialization.
Router auto-detects model theme (medicine, legal, code) and routes to specialized workers. xLSTM predictor forecasts load spikes.
Router documentation →Send the same query to multiple LLMs simultaneously and combine results. The router's Swarm LLM strategies use multi-model consensus for higher accuracy.
Sepp Hochreiter's extended LSTM architecture is integrated throughout the Eldric platform for predictive workloads.
Route different models to different backends seamlessly. Local Ollama for small models, vLLM for batch inference, Cloud Workers for GPT-4o/Claude, all behind a single OpenAI-compatible API.
Meta's Large Concept Model (LCM) reasons in concept space rather than token space. VL-JEPA predicts latent video-language representations. Both run as Eldric backends.