Inference providers · use case

The control plane your inference business already needs

by Juergen Paulhart · 2026-04-24 · ~7 min read

“Our engineers are shipping model performance. They’re also, for the third time, re-implementing tenant quotas, usage metering, conversation history, and an audit trail. The model isn’t the moat.”
YOUR BRAND · your domain · your marketing whitelabelled webchat at chat.yourbrand.ai · your logo · your theme ELDRIC AI OS control plane — boring plumbing, open source Identity users/tenants/keys 4 account types Router 5 LB strategies intent classifier Quota + billing /metrics · usage per-tenant limits Memory vector + matrix per-user scope Audit hash-chained GDPR ready Edge: webchat shell + /v1/chat/completions + SSE streaming + API keys 14 dashboards · modular JS · whitelabel-ready theme your engineers stop writing this layer YOUR GPU FLEET — the actual product H100 partition premium tier A100 partition standard tier 4090 partition hobbyist tier Cloud burst bedrock / xai / groq

If you run an inference business — an API-for-model service, a sovereign-cloud GPU operator, a regional private-AI provider, a whitelabel agency — the interesting engineering is the inference. Everything around it (tenants, quotas, routing, auth, memory, audit, billing hooks, compliance reports, the chat UI) is undifferentiated heavy lifting. Every provider rebuilds a version of it poorly.

Eldric AI OS is the shape of that control plane, open-source, deployable in a day. Whitelabel the chat UI, run your own GPU fleet upstream, let the 15 role modules do the plumbing your engineers keep re-implementing.

Value propositions

Whitelabel-ready webchat

Modular shell at /chat. Swap the logo, swap the theme, swap the footer. 14 dashboards (admin, agents, training, knowledge, …) that look like your product, not ours.

Real multi-tenant from day one

Users, tenants, projects, workgroups, API keys, 4 account types — all in alpha.3. Customer onboarding is a POST; tenant isolation is enforced at the router.

Model-agnostic routing

Backend abstraction already supports Ollama, vLLM, TGI, llama.cpp, MLX, OpenAI, Anthropic, xAI, Groq, Together, HuggingFace, Eldric-native. Your in-house engine is one more plugin.

Hash-chained audit + GDPR

Every prompt, every retrieval, every tool call. Enterprise customers ask for that reports-ready; you ship it out of the box.

Matrix Memory as a product surface

Your customers get persistent, cross-session memory without you building a vector-DB business. Compressed associative recall via xLSTM-inspired Matrix Memory is a premium feature.

Boring supply chain

Signed RPM on EU-hosted repo.eldric.ai. Upgrade via dnf upgrade. No vendor dance, no hostage procurement.

AI-driven differentiator

The serious inference providers (Together, Fireworks, Groq, Mistral, Ollama Cloud) differentiate on hardware and scheduler — the control plane is table stakes they each rebuild. Eldric turns table stakes into a commodity. The commercial position: sell hardware and SLA, not the tenant-and-audit layer. Your control plane becomes open source, which is a feature for sovereign-cloud customers who don’t want another black box.

Scalable use cases

Runs on commodity hardware

Eldric AI OS was built to land on small clusters, not on hyperscaler fleets. The whole stack is one binary; the on-prem LLM is embedded llama.cpp. The hardware plan that gets most organisations into production looks like this:

3× RTX 4090 — sweet spot

72 GB total VRAM with tensor-split. Llama 3.3 70B Q4 at 60–80 tok/s, a parallel 8B routing model, and an embedding server concurrently. One-time hardware cost ~€5–7k.

Single RTX 4090 / 4080 — team scale

24 GB. Llama 3.1 8B at 80+ tok/s, 13B comfortable, 32B Q4 possible. Enough for a small department chat with fan-out retrieval.

CPU-only — pilot scale

llama.cpp on 32+ core x86 runs 8B Q4 usefully. Matrix Memory is CPU-memory-bound. A refurbished server from the rack is enough to prove the architecture.

Scale up

Multi-node cluster with H100 / GH200 for research-grade workloads. Same binary, same role modules, topology-aware. See the HPC article.

Starter rack

A single 8-GPU H100 or 3×4090 node is enough to pilot 50–100 paying tenants. Scale horizontally by adding inference-role nodes to the same controller.

The arithmetic: a €6k workstation displaces a €30–60k-per-year SaaS-AI contract that still leaks IP, still can’t reach your mainframe, and still has a “we may use your data for training” clause hiding somewhere.

What the disk bill looks like

ArtefactSizeNotes
eldric-aios-5.0.0-3.alpha3.fc43.x86_64.rpm~1.4 MBCPU baseline binary; one RPM, one systemd unit.
eldric-aios-cuda add-on~512 MBPulled in automatically via Supplements: cuda-drivers on GPU hosts. Contains GGML_CUDA llama.cpp.
Llama 3.1 8B Q4_K_M GGUF~4.9 GBGood default for team-scale chat on a single 4090.
Llama 3.3 70B Q4_K_M GGUF~40 GBThe sweet spot for 3×4090 tensor-split. Holds a 16k context comfortably.
Mixtral 8x22B Q4 GGUF~80 GBTight on 3×4090; comfortable on 4×4090 or 2×H100.
nomic-embed-text (embedding)~700 MBCPU or GPU. One per cluster; handles vector indexing.
Matrix Memory .emm per domain50–500 MBDepends on rank × dim (see memory article). chat 64/768 ~200 kB; particle_physics 512/1024 ~500 MB.
Vector store per 1M chunks~6–10 GBDepends on embedding dim. SQLite backend; FAISS optional.
Hash-chained audit log~200 MB / 1M callsJSONL, append-only, rotation at 500 MB files by default.

Three reference hardware setups

Pilot / teamDepartment / BUProduction / enterprise
CPU1× EPYC 7313 (16c) or i9-14900K2× EPYC 9354 (32c each)2× EPYC 9654 (96c) per node
GPU1× RTX 4090 (24 GB)3× RTX 4090 (72 GB)4× H100 (320 GB) or 8× H200
RAM128 GB DDR5256 GB DDR5 ECC1 TB DDR5 ECC per node
Storage2× 4 TB NVMe (RAID-1)6× 8 TB NVMe (RAID-10) + SSD cacheTiered: NVMe hot + TB-scale HDD / Lustre
Network1 GbE OK10 GbE with link agg25/100 GbE or IB-HDR for multi-node
Power~1 kW typical / 1.5 kW peak~2 kW typical / 3 kW peak4–6 kW per node
Hardware cost~€4–5k~€12–15k€80–250k per node
Serves8B model, 10–30 concurrent chat users70B Q4 at 60–80 tok/s, 200–500 usersMixtral / Llama-405B, 2k+ users per node

Network + ops footprint

SWOT — an honest read

Strengths

  • Whitelabel chat UI + 14 dashboards shipped modular
  • Tenant, quota, routing, auth, memory, audit — all six pieces real in alpha.3
  • Model-agnostic backend layer: Ollama, vLLM, OpenAI-compat, Anthropic, xAI, Groq, Together
  • Open source — no vendor lock-in angle for sovereign-cloud sell

Weaknesses

  • Billing hook is generic (metrics + audit feeds) — Stripe / Zuora integrations are customer-built
  • No vendor-grade SLA yet; 24×7 commercial support is a contract item
  • alpha.3 — maturing fast but not yet at the stability of established SaaS platforms
  • Marketplace of third-party extensions still thin

Opportunities

  • Sovereign-cloud push in EU / Middle East / Southeast Asia
  • Hyperscaler fatigue among enterprise buyers
  • GDPR + AI Act enforcement making “EU data residency” a paid feature
  • Growth of smaller GPU clouds that need a control plane they don’t want to build

Threats

  • Hyperscaler undercutting on commodity inference
  • Proprietary control planes (Bedrock, Azure AI Studio) with deep AWS/Azure integration
  • Other OSS orchestration stacks (LiteLLM, OpenDevin) for lighter use cases
  • Providers building their own bespoke control plane instead of buying one

First entry points — concrete value in 30 / 90 / 180 days

30 days

Stand up the reference stack

Deploy alpha.3 on a single box. Wire your GPU fleet as backends. Whitelabel the webchat with your brand in one CSS patch. Test with 3 internal tenants.

90 days

First paying tenant

Onboard a friendly customer with API keys + quota. Usage metering into your billing system via /metrics. SLA draft, audit report scheduled monthly.

180 days

Commercial launch

Tiered pricing live. 5-10 tenants. Support rotation documented. Extensions marketplace for vertical connectors. Ready for the next fundraise conversation.

Install alpha.3 Privacy-first HPC use case Data access article office@eldric.ai
#InferenceProviders #SovereignAI #Whitelabel #MultiTenant #OpenSourceInfra #GPUCloud