Multi-backend cloud inference gateway. Route AI requests across xAI, OpenAI, Anthropic, DeepSeek, Groq, and more — from a single unified OpenAI-compatible endpoint.
Port 8889 · eldric-cloudd
The Cloud Worker (eldric-cloudd) is a multi-backend cloud inference gateway running on Port 8889. It manages multiple cloud API backends simultaneously, routing requests to the correct provider based on the requested model name. Unlike the standard inference worker (which proxies a single local backend like Ollama), the Cloud Worker aggregates dozens of cloud APIs behind one unified endpoint.
Connect to xAI, OpenAI, Anthropic, DeepSeek, Groq, Together AI, Fireworks, Mistral, Cohere, and any OpenAI-compatible endpoint — all through a single gateway.
Requests are automatically routed to the correct backend based on the model name. Ask for grok-3 and traffic goes to xAI; ask for claude-3.5-sonnet and it goes to Anthropic.
Exposes a standard /v1/chat/completions endpoint with full streaming SSE support. Drop-in replacement for any OpenAI client library.
Continuous 60-second health checks on all connected backends. Unhealthy backends are automatically excluded from routing until they recover.
Priority-based routing with automatic fallback to a default backend when the requested model is not found on any provider.
Deploy multiple Cloud Worker instances. Each registers independently with the controller using a cld- ID prefix, enabling load distribution across gateways.
The Cloud Worker sits within the Eldric distributed stack alongside regular inference workers. The controller and router treat it as a specialized worker with worker_type: "cloud".
/v1/chat/completions
model: "grok-3"
Model lookup & backend resolution
Forwarded with SSE streaming
Streamed back to caller
The Cloud Worker supports all major cloud LLM providers and any endpoint following the OpenAI API specification.
| Backend | API Endpoint | Example Models | Streaming | Tool Calling |
|---|---|---|---|---|
| Ollama Cloud | http://remote:11434 |
llama3.2, qwen3, mistral, gemma3 | Yes | Yes |
| xAI / Grok | https://api.x.ai/v1 |
grok-3, grok-3-mini, grok-2 | Yes | Yes |
| OpenAI | https://api.openai.com/v1 |
gpt-4o, gpt-4-turbo, gpt-4, o1, o3 | Yes | Yes |
| Anthropic | https://api.anthropic.com/v1 |
claude-sonnet-4-20250514, claude-3.5-sonnet, claude-3-opus | Yes | Yes |
| DeepSeek | https://api.deepseek.com/v1 |
deepseek-chat, deepseek-reasoner | Yes | Yes |
| Groq | https://api.groq.com/openai/v1 |
llama-3.3-70b, mixtral-8x7b, gemma2-9b | Yes | Yes |
| Together AI | https://api.together.xyz/v1 |
meta-llama/Llama-3.3-70B, Qwen/Qwen2.5-72B | Yes | Yes |
| Fireworks AI | https://api.fireworks.ai/inference/v1 |
llama-v3p3-70b, mixtral-8x22b | Yes | Yes |
| Mistral AI | https://api.mistral.ai/v1 |
mistral-large, mistral-medium, codestral | Yes | Yes |
| Cohere | https://api.cohere.ai/v1 |
command-r-plus, command-r, embed-english | Yes | Yes |
| OpenAI-Compatible | Any URL |
Any model following OpenAI spec | Yes | Varies |
Frontier reasoning with Grok-3
CloudGPT-4o, o1, o3 reasoning
CloudClaude Sonnet, Opus
CloudCost-effective reasoning
CloudUltra-fast inference
CloudOpen model hosting
CloudOptimized open models
CloudEuropean AI provider
CloudEnterprise NLP & RAG
CloudRemote Ollama instances
Remote LocalAny OpenAI-compatible API
CompatibleThe Cloud Worker automatically discovers available models from all connected backends at startup and refreshes periodically. When a new backend is added, its models are queried and merged into the unified model catalog.
All models from all backends are aggregated into a single catalog, served via GET /v1/models. Each model entry includes its source backend, so clients can see every available model in one call.
Backends can be configured statically via CLI flags or configuration file, and added dynamically via the API at runtime without restarting the worker.
--backends flagWhen a chat completion request arrives, the Cloud Worker resolves the model name to the correct backend. If multiple backends serve the same model, priority determines which backend handles the request. If the model is not found on any backend, the request falls back to the configured default backend.
Each request specifies a model name. The Cloud Worker maintains a model-to-backend mapping built from auto-discovery. The first backend that owns the model receives the request.
A default backend can be configured for requests where the model is not found. This ensures requests always have somewhere to go, even with unknown model names.
Only healthy backends are considered for routing. If a backend fails health checks, its models are temporarily excluded until the backend recovers.
If step 3 fails (model not found), the request is forwarded to the default backend. If step 4 fails (backend unhealthy), the request returns an error with backend status details.
The Cloud Worker continuously monitors the health of all connected backends at a 60-second interval. Health checks verify API reachability, authentication validity, and model availability.
Every 60 seconds, each backend receives a lightweight health probe. Response time and status are recorded for routing decisions.
Backends that fail health checks are marked unhealthy and excluded from routing. They are automatically re-included when checks pass again.
Backend health status, latency, and model counts are displayed on the web dashboard at /dashboard for at-a-glance monitoring.
Health checks also verify that API keys are valid and not expired, alerting before requests start failing due to authentication issues.
Deploy multiple Cloud Worker instances for high availability and load distribution. Each instance registers independently with the controller, and the router distributes requests across all healthy cloud workers using standard load-balancing strategies.
Each Cloud Worker instance registers with the controller using a unique cld- prefixed ID. The controller tracks all instances and their backend configurations.
The router distributes requests across cloud workers using standard strategies: round-robin, least-connections, load-based, or latency-based.
Each instance can connect to different sets of backends. Instance #1 might connect to xAI and Groq, while Instance #2 handles OpenAI and Anthropic.
The Cloud Worker registers with the controller as a specialized worker type. This integration enables centralized monitoring, routing, and management alongside regular inference workers.
| Property | Value | Description |
|---|---|---|
worker_type |
"cloud" |
Distinguishes from regular inference workers ("inference") |
| ID Prefix | cld- |
All cloud worker IDs start with cld- (e.g., cld-a1b2c3d4) |
| Default Port | 8889 | Avoids conflicts with regular workers (8890) and inference backends |
| Heartbeat | 30s interval | Reports health, active connections, and backend status to controller |
| Model Reporting | Aggregated list | Reports all models from all connected backends to the controller |
The Cloud Worker exposes an OpenAI-compatible REST API with full streaming SSE support. Any client that works with the OpenAI API — including OpenWebUI, LangChain, LlamaIndex, and the Eldric CLI/GUI — can connect directly.
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with backend status summary |
/dashboard |
GET | Web-based monitoring dashboard |
/v1/models |
GET | List all models from all connected backends |
/v1/chat/completions |
POST | Chat completion (streaming and non-streaming) |
/api/v1/backends |
GET | List connected backends with health status |
/api/v1/backends |
POST | Add a new backend at runtime |
/api/v1/backends/{id} |
DELETE | Remove a backend |
/api/v1/backends/{id}/test |
POST | Test backend connectivity |
Set "stream": true in chat completion requests for real-time token streaming via Server-Sent Events. The Cloud Worker proxies the SSE stream directly from the cloud backend with zero-copy forwarding.
Access the monitoring dashboard at http://cloud-worker:8889/dashboard for real-time visibility into backend status, model availability, request routing, and performance metrics.
View health status, response latency, and uptime for each connected cloud backend. Color-coded indicators for quick assessment.
Browse all discovered models across all backends, with source attribution and availability status.
Real-time request counts, routing decisions, and error rates per backend. Track which backends handle the most traffic.
View and manage backend configurations, API keys, default backend selection, and health check intervals.
Start the Cloud Worker with the eldric-cloudd binary. Configure backends via command-line flags or a JSON configuration file.
| Option | Default | Description |
|---|---|---|
--port |
8889 |
Listen port for the Cloud Worker API |
--controller |
— | Controller URL for registration and heartbeat |
--backends |
— | Comma-separated backend list: name:url:apikey |
--default-backend |
First backend | Backend name to use when model is not found |
--config |
— | Path to JSON configuration file |
--health-interval |
60000 |
Health check interval in milliseconds |
For complex setups, use a JSON configuration file instead of command-line flags. This allows per-backend priority, custom headers, and other advanced settings.
API keys can be provided via the configuration file, command-line flags, or environment variables. For production deployments, environment variables are recommended to avoid storing keys in configuration files.
The Cloud Worker bridges local infrastructure with cloud AI providers. Here are the primary deployment scenarios.
Run most traffic through local Ollama workers for privacy and cost control, while accessing frontier cloud models (GPT-4o, Claude, Grok-3) for complex tasks that require stronger reasoning. The router directs requests to local workers or the cloud worker based on the requested model.
Configure multiple backends for the same model family. If your primary provider (e.g., OpenAI) experiences downtime, requests automatically fail over to a secondary provider (e.g., Together AI running the same open model).
Route simple queries to low-cost providers (DeepSeek, Groq) while reserving expensive frontier models (GPT-4o, Claude Opus) for complex reasoning tasks. Combined with AI-powered routing on the router, this can be fully automated.
Give your entire organization access to the latest frontier models from every major provider through a single, centralized gateway. No need for individual API keys per user or application — the Cloud Worker manages authentication centrally.
Both worker types register with the controller and serve requests via the router, but they serve different purposes.
| Aspect | Inference Worker (eldric-workerd) | Cloud Worker (eldric-cloudd) |
|---|---|---|
| Port | 8890 | 8889 |
| Worker Type | inference |
cloud |
| ID Prefix | wrk- |
cld- |
| Backend Count | Single backend (Ollama, vLLM, etc.) | Multiple backends simultaneously |
| Backend Location | Typically local / same network | Cloud APIs (remote endpoints) |
| Model Discovery | From single backend | Aggregated from all backends |
| Routing | Direct proxy to backend | Model-based routing to correct backend |
| Use Case | Local GPU inference | Cloud API aggregation & routing |
Get up and running with the Cloud Worker in minutes.
| Feature | Free | Standard | Professional | Enterprise |
|---|---|---|---|---|
| Cloud Workers | 1 | 2 | 5 | Unlimited |
| Backends per Worker | 2 | 5 | 15 | Unlimited |
| Auto-Discovery | Yes | Yes | Yes | Yes |
| Priority Routing | — | Yes | Yes | Yes |
| Health Monitoring | Yes | Yes | Yes | Yes |
| Dashboard | — | — | Yes | Yes |
| Runtime Backend Add/Remove | — | Yes | Yes | Yes |
Contact license@core.at for enterprise licensing. License files are managed via the license server.