Inference providers · use case

The control plane your inference business already needs

by Juergen Paulhart · 2026-04-24 · ~7 min read

“Our engineers are shipping model performance. They’re also, for the third time, re-implementing tenant quotas, usage metering, conversation history, and an audit trail. The model isn’t the moat.”

If you run an inference business — an API-for-model service, a sovereign-cloud GPU operator, a regional private-AI provider, a whitelabel agency — the interesting engineering is the inference. Everything around it (tenants, quotas, routing, auth, memory, audit, billing hooks, compliance reports, the chat UI) is undifferentiated heavy lifting. Every provider rebuilds a version of it poorly.

Eldric AI OS is the shape of that control plane, open-source, deployable in a day. Whitelabel the chat UI, run your own GPU fleet upstream, let the 15 role modules do the plumbing your engineers keep re-implementing.

Value propositions

Whitelabel-ready webchat

Modular shell at /chat. Swap the logo, swap the theme, swap the footer. 14 dashboards (admin, agents, training, knowledge, …) that look like your product, not ours.

Real multi-tenant from day one

Users, tenants, projects, workgroups, API keys, 4 account types — all in alpha.3. Customer onboarding is a POST; tenant isolation is enforced at the router.

Model-agnostic routing

Backend abstraction already supports Ollama, vLLM, TGI, llama.cpp, MLX, OpenAI, Anthropic, xAI, Groq, Together, HuggingFace, Eldric-native. Your in-house engine is one more plugin.

Hash-chained audit + GDPR

Every prompt, every retrieval, every tool call. Enterprise customers ask for that reports-ready; you ship it out of the box.

Matrix Memory as a product surface

Your customers get persistent, cross-session memory without you building a vector-DB business. Compressed associative recall via xLSTM-inspired Matrix Memory is a premium feature.

Boring supply chain

Signed RPM on EU-hosted repo.eldric.ai. Upgrade via dnf upgrade. No vendor dance, no hostage procurement.

AI-driven differentiator

The serious inference providers (Together, Fireworks, Groq, Mistral, Ollama Cloud) differentiate on hardware and scheduler — the control plane is table stakes they each rebuild. Eldric turns table stakes into a commodity. The commercial position: sell hardware and SLA, not the tenant-and-audit layer. Your control plane becomes open source, which is a feature for sovereign-cloud customers who don’t want another black box.

Scalable use cases

Sovereign AI clouds. EU / national operator wants EuroHPC-adjacent inference. Eldric is already EU-hosted, open-source, GDPR-shaped. Zero narrative friction.
Regional GPU operators. Sold capacity to local SMEs. Eldric gives the SMEs a chat + dashboards + memory without you writing one.
Whitelabel agencies. Selling “private AI for $vertical”. Eldric runs the stack; your brand is the customer-visible layer.
Enterprise internal platforms. Big corp wants a sanctioned ChatGPT replacement for 30k employees. Eldric as internal SaaS, one identity service, tenant-per-department.
MSP / MSSP offerings. Managed service providers add “private AI” to their bundle without building it. Eldric is the product; the MSP is the distribution.

Runs on commodity hardware

Eldric AI OS was built to land on small clusters, not on hyperscaler fleets. The whole stack is one binary; the on-prem LLM is embedded llama.cpp. The hardware plan that gets most organisations into production looks like this:

3× RTX 4090 — sweet spot

72 GB total VRAM with tensor-split. Llama 3.3 70B Q4 at 60–80 tok/s, a parallel 8B routing model, and an embedding server concurrently. One-time hardware cost ~€5–7k.

Single RTX 4090 / 4080 — team scale

24 GB. Llama 3.1 8B at 80+ tok/s, 13B comfortable, 32B Q4 possible. Enough for a small department chat with fan-out retrieval.

CPU-only — pilot scale

llama.cpp on 32+ core x86 runs 8B Q4 usefully. Matrix Memory is CPU-memory-bound. A refurbished server from the rack is enough to prove the architecture.

Scale up

Multi-node cluster with H100 / GH200 for research-grade workloads. Same binary, same role modules, topology-aware. See the HPC article.

Starter rack

A single 8-GPU H100 or 3×4090 node is enough to pilot 50–100 paying tenants. Scale horizontally by adding inference-role nodes to the same controller.

The arithmetic: a €6k workstation displaces a €30–60k-per-year SaaS-AI contract that still leaks IP, still can’t reach your mainframe, and still has a “we may use your data for training” clause hiding somewhere.

What the disk bill looks like

Artefact	Size	Notes
`eldric-aios-5.0.0-3.alpha3.fc43.x86_64.rpm`	~1.4 MB	CPU baseline binary; one RPM, one systemd unit.
`eldric-aios-cuda` add-on	~512 MB	Pulled in automatically via `Supplements: cuda-drivers` on GPU hosts. Contains GGML_CUDA llama.cpp.
Llama 3.1 8B Q4_K_M GGUF	~4.9 GB	Good default for team-scale chat on a single 4090.
Llama 3.3 70B Q4_K_M GGUF	~40 GB	The sweet spot for 3×4090 tensor-split. Holds a 16k context comfortably.
Mixtral 8x22B Q4 GGUF	~80 GB	Tight on 3×4090; comfortable on 4×4090 or 2×H100.
nomic-embed-text (embedding)	~700 MB	CPU or GPU. One per cluster; handles vector indexing.
Matrix Memory `.emm` per domain	50–500 MB	Depends on rank × dim (see memory article). `chat` 64/768 ~200 kB; `particle_physics` 512/1024 ~500 MB.
Vector store per 1M chunks	~6–10 GB	Depends on embedding dim. SQLite backend; FAISS optional.
Hash-chained audit log	~200 MB / 1M calls	JSONL, append-only, rotation at 500 MB files by default.

Three reference hardware setups

	Pilot / team	Department / BU	Production / enterprise
CPU	1× EPYC 7313 (16c) or i9-14900K	2× EPYC 9354 (32c each)	2× EPYC 9654 (96c) per node
GPU	1× RTX 4090 (24 GB)	3× RTX 4090 (72 GB)	4× H100 (320 GB) or 8× H200
RAM	128 GB DDR5	256 GB DDR5 ECC	1 TB DDR5 ECC per node
Storage	2× 4 TB NVMe (RAID-1)	6× 8 TB NVMe (RAID-10) + SSD cache	Tiered: NVMe hot + TB-scale HDD / Lustre
Network	1 GbE OK	10 GbE with link agg	25/100 GbE or IB-HDR for multi-node
Power	~1 kW typical / 1.5 kW peak	~2 kW typical / 3 kW peak	4–6 kW per node
Hardware cost	~€4–5k	~€12–15k	€80–250k per node
Serves	8B model, 10–30 concurrent chat users	70B Q4 at 60–80 tok/s, 200–500 users	Mixtral / Llama-405B, 2k+ users per node

Network + ops footprint

Ports. One outward port (443 at the edge). Internally: controller on 8880, data on 8892, inference on 8883, science on 8897, etc. — all behind the edge.
Storage layout. ${ELDRIC_DATA_DIR} defaults to /data/eldric if writable, else /var/lib/eldric. Subdirs: models/, vectors/, memory/ (matrix memory), storage/ (file storage), agent/, edge/, and per-module dirs.
Backup. The audit log and .emm files are the two artefacts that matter. Everything else regenerates. Snapshot the data dir nightly; off-site every week.
Updates. dnf upgrade eldric-aios. Rollback is dnf downgrade. Zero vendor dance.
Ops team. A single systems engineer can run a pilot install. A team of two runs a department deployment. Production enterprise uses your existing Linux sysadmin rota.

SWOT — an honest read

Strengths

Whitelabel chat UI + 14 dashboards shipped modular
Tenant, quota, routing, auth, memory, audit — all six pieces real in alpha.3
Model-agnostic backend layer: Ollama, vLLM, OpenAI-compat, Anthropic, xAI, Groq, Together
Open source — no vendor lock-in angle for sovereign-cloud sell

Weaknesses

Billing hook is generic (metrics + audit feeds) — Stripe / Zuora integrations are customer-built
No vendor-grade SLA yet; 24×7 commercial support is a contract item
alpha.3 — maturing fast but not yet at the stability of established SaaS platforms
Marketplace of third-party extensions still thin

Opportunities

Sovereign-cloud push in EU / Middle East / Southeast Asia
Hyperscaler fatigue among enterprise buyers
GDPR + AI Act enforcement making “EU data residency” a paid feature
Growth of smaller GPU clouds that need a control plane they don’t want to build

Threats

Hyperscaler undercutting on commodity inference
Proprietary control planes (Bedrock, Azure AI Studio) with deep AWS/Azure integration
Other OSS orchestration stacks (LiteLLM, OpenDevin) for lighter use cases
Providers building their own bespoke control plane instead of buying one

First entry points — concrete value in 30 / 90 / 180 days

30 days

Stand up the reference stack

Deploy alpha.3 on a single box. Wire your GPU fleet as backends. Whitelabel the webchat with your brand in one CSS patch. Test with 3 internal tenants.

90 days

First paying tenant

Onboard a friendly customer with API keys + quota. Usage metering into your billing system via /metrics. SLA draft, audit report scheduled monthly.

180 days

Commercial launch

Tiered pricing live. 5-10 tenants. Support rotation documented. Extensions marketplace for vertical connectors. Ready for the next fundraise conversation.

Install alpha.3 Privacy-first HPC use case Data access article office@eldric.ai

#InferenceProviders #SovereignAI #Whitelabel #MultiTenant #OpenSourceInfra #GPUCloud