Eldric Router

Intelligent load balancing with AI-powered routing decisions

v4.1.0

Architecture Overview

The Router sits between Edge/Controller and Workers, distributing inference requests across the cluster using configurable strategies and optional AI-powered decision making.

Router Architecture & Request Flow
Edge Server Port 443 TLS + Auth + Rate Limit External Clients OpenWebUI, Apps, APIs Controller Port 8880 | Worker Sync HTTPS sync Router eldric-routerd | Port 8881 AI Decision Engine LLM-Powered Worker Selection SSE Streaming 5 Load Balancing Strategies 1 Worker - Ollama Port 8890 | GPU | llama3.2 2 Worker - vLLM Port 8890 | GPU | qwen2.5 3 Worker - TGI Port 8890 | GPU | mistral Cloud Worker Port 8889 | OpenAI, Anthropic, xAI Data Worker :8892 Legend Edge (TLS Gateway) Router (LB) AI Decision Engine Inference Workers Cloud Worker Worker Load

Overview

The Eldric Router operates on Port 8881 and serves as the intelligent traffic distribution layer between the Edge Server or Controller and backend Workers. It supports five built-in load balancing strategies and optional AI-powered routing for context-aware worker selection.

Intelligent Distribution

Five load balancing strategies from simple round-robin to AI-powered autonomous routing with LLM-based decision making.

Worker Health Monitoring

Continuous health checks with automatic failover. Unhealthy workers are removed from rotation and re-added when recovered.

Zero-Copy Streaming

Server-Sent Events (SSE) proxy with zero-copy forwarding. OpenAI-compatible streaming from any backend through the router.

Controller Sync

Syncs worker registry from the Controller at configurable intervals. Can also operate standalone with manually configured workers.

Load Balancing Strategies

The router provides five built-in strategies for distributing requests across workers. The default strategy is load_based.

round_robin

Simple rotation through available workers. Each worker receives requests in sequence. Best for homogeneous worker pools with similar capabilities.

Default Fallback

least_connections

Routes to the worker with the fewest active requests. Naturally adapts to workers with different processing speeds. Good for mixed hardware environments.

Recommended

load_based

Routes based on reported worker load metrics (CPU, memory, GPU utilization). Workers report their load during health checks. The default strategy.

Default

latency_based

Tracks response times per worker and routes to the fastest. Adapts in real-time as latency changes. Ideal for geographically distributed clusters.

Performance

random

Random worker selection from the healthy pool. Provides natural distribution without tracking state. Useful for testing and simple deployments.

Basic

Swarm LLM & Ensemble Mode

Fan-out inference across multiple LLM workers with intelligent response synthesis. The router automatically selects the optimal strategy based on query content.

debate

Models argue different positions across multiple rounds. A judge model evaluates arguments and renders the final verdict. Best for decisions and architecture questions.

Multi-Round

critique

Model A generates a response, Model B critiques it, Model A revises. Iterates for configurable rounds. Best for writing, planning, and content refinement.

Iterative

best_of_n

Fan-out to N models in parallel, a judge picks the single best answer. Best for code generation where merging multiple outputs produces inconsistencies.

Parallel

vote

All models answer independently, consensus analysis identifies agreement and disagreement with confidence indicators. Best for factual and classification questions.

Consensus

synthesize

Default strategy. Merge insights from all model responses into one comprehensive answer. A synthesis model combines the best elements from each response.

Default

xLSTM Predictor

Optional xLSTM (Sepp Hochreiter) integration for workload forecasting, anomaly detection, and fast sequence classification. Enables proactive scaling decisions before load spikes hit.

Load Balancing Strategy Comparison
Round Robin W1 W2 W3 W4 Sequential rotation 1 → 2 → 3 → 4 → 1 Default Fallback Best for homogeneous worker pools Least Connections W1 8 W2 3 W3 1 Routes to worker with fewest active requests Recommended Best for mixed hardware environments Load Based W1 CPU GPU 25% W2 CPU GPU 78% W3 CPU GPU 95% Routes to worker with lowest CPU/GPU/memory ★ DEFAULT Best overall strategy for production clusters Latency Based W1 245ms W2 89ms W3 12ms Routes to worker with fastest response time Performance Best for geo-distributed clusters Random W1 W2 W3 Random selection from healthy worker pool Basic Best for testing and simple deployments
# Configure load balancing strategy via Controller API curl -X POST http://controller:8880/api/v1/router/config \ -H "Content-Type: application/json" \ -d '{"strategy": "least_connections"}' # Or configure directly on the router curl -X POST http://router:8881/api/v1/config \ -H "Content-Type: application/json" \ -d '{"strategy": "latency_based"}'

AI-Based Routing

Eldric supports intelligent AI-controlled routing where an LLM makes real-time worker selection decisions based on request context, worker capabilities, and cluster state.

AI Control Modes

none

AI routing disabled. Uses algorithmic load balancing only. Lowest latency overhead.

Default
advisory

AI suggests a worker but the system only logs the suggestion. Useful for testing AI routing before enabling it.

Evaluation
autonomous

AI makes the routing decision. The LLM evaluates worker load, latency, and capabilities to select the optimal target.

Enterprise

Router LLM Model

AI routing uses a dedicated Ollama model for decision making. Eldric includes a custom-trained routing model optimized for fast, accurate worker selection. Any Ollama-compatible model can be used, but smaller models (1B-3B parameters) are recommended to minimize routing latency.

Decision Flow

AI Routing Decision Pipeline
Request Arrives /v1/chat/completions Find Workers Filter by model "llama3.2:3b" Build Targets Gather metrics: load, latency, capabilities, GPU util, active connections AI Enabled? YES LLM Analyzes Targets Context-aware worker selection NO fails Algorithmic Fallback load_based → round_robin Standard strategy selection Forward Proxy to selected worker Response includes: "routing.reason": "..." Decision Flow

Enable AI Routing

# Enable autonomous AI routing curl -X POST http://router:8881/api/v1/ai/configure \ -H "Content-Type: application/json" \ -d '{ "mode": "autonomous", "llm_model": "llama3.2:3b", "ollama_url": "http://localhost:11434" }' # Check AI routing status curl http://router:8881/api/v1/ai/status

Response with Routing Info

When AI routing is active, responses include routing metadata explaining the decision.

{ "worker": "wrk-abc123", "worker_host": "10.3.7.20", "routing": { "strategy": "ai_routing", "reason": "Low latency and capability match for code generation tasks" } }

Knowledge Routing

The router supports content-aware knowledge routing with 150+ predefined themes. Incoming requests are analyzed and routed to workers specialized in the relevant domain -- from scientific computing to creative writing to code generation.

How Knowledge Routing Works

The router classifies incoming prompts by topic and matches them to workers configured with domain expertise. A request about molecular biology routes to a worker running a science-tuned model, while a coding question routes to a worker with a code-optimized model.

  • 150+ predefined knowledge themes across science, engineering, humanities, and more
  • Automatic prompt classification using keyword and semantic analysis
  • Worker capability tags for domain specialization
  • Fallback to standard load balancing when no specialization match is found

For full details, see the Knowledge Routing documentation.

Worker Health Monitoring

The router continuously monitors worker health and automatically manages the active worker pool.

Periodic Health Checks

The router polls each worker's /health endpoint at a configurable interval (default: 30 seconds). Workers report their status, load metrics, available models, and GPU utilization.

Automatic Failover

When a worker fails health checks, it is removed from the active rotation. Requests are automatically redistributed to remaining healthy workers with no client impact.

Recovery Detection

Unhealthy workers continue to be checked. When they recover, they are automatically re-added to the active pool. No manual intervention required.

Standalone Router Daemon

The router can run as a standalone daemon (eldric-routerd) that syncs workers from the Controller and operates independently.

Router Mode Features

  • Syncs worker registry from the Controller at configurable intervals
  • Operates independently even if the Controller goes offline temporarily
  • Supports multiple router instances for high availability
  • Maintains its own health check cycle for all known workers
  • Configurable sync interval (default: 30 seconds)
# Run standalone router daemon ./eldric-routerd --port 8881 \ --controller http://controller:8880 \ --sync-interval 30000 # Router with AI routing enabled ./eldric-routerd --port 8881 \ --controller http://controller:8880 \ --ai-mode autonomous \ --ai-model llama3.2:3b \ --ollama-url http://localhost:11434 # Router with specific strategy ./eldric-routerd --port 8881 \ --controller http://controller:8880 \ --strategy latency_based

Streaming Support

The router provides zero-copy SSE (Server-Sent Events) proxy, forwarding streaming responses from workers to clients in real-time with OpenAI-compatible format.

Zero-Copy Streaming Pipeline
REQUEST FLOW Client POST /v1/chat Edge TLS + Auth Router Load Balance + Forward Zero-copy proxy Worker Inference Proxy Backend Ollama / vLLM / TGI SSE RESPONSE STREAM Generate {"delta":"Hello"} SSE Pass-through {"delta":"!"} Zero-copy Forward {"delta":" How"} TLS Encrypt + SSE Render Tokens data: {"choices":[{"delta":{"content":"Hello"}}]} ... data: [DONE]
# Streaming chat through the router (OpenAI-compatible) curl -X POST http://router:8881/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello!"}], "stream": true }' # SSE response format data: {"choices":[{"delta":{"content":"Hello"}}]} data: {"choices":[{"delta":{"content":"!"}}]} data: {"choices":[{"delta":{"content":" How"}}]} data: [DONE]

API Endpoints

Endpoint Method Description
/health GET Basic health check
/api/v1/health GET Detailed health status with worker counts and uptime
/v1/chat/completions POST OpenAI-compatible chat completions (proxied to workers). Supports streaming via SSE.
/v1/models GET List available models across all workers (aggregated, deduplicated)
/api/v1/ai/configure POST Configure AI routing mode, LLM model, and Ollama URL
/api/v1/ai/status GET Get current AI routing configuration and statistics
/api/v1/workers GET List workers known to this router (synced from Controller)
/api/v1/data/query POST Proxied to Data Worker for database queries

Configuration

The router can be configured via a JSON configuration file or command-line arguments.

// router.json { "port": 8881, "bind_address": "0.0.0.0", "controller_url": "http://controller:8880", "sync_interval_ms": 30000, "health_check_interval_ms": 30000, "strategy": "load_based", "ai_routing": { "mode": "none", "llm_model": "llama3.2:3b", "ollama_url": "http://localhost:11434" }, "workers": [ { "url": "http://10.3.7.47:8890", "tags": ["gpu", "inference"] }, { "url": "http://10.19.0.12:8890", "tags": ["gpu", "inference"] } ] }

CLI Usage

# Start router with controller sync ./eldric-routerd --port 8881 --controller http://controller:8880 # Start with custom sync interval (60 seconds) ./eldric-routerd --port 8881 --controller http://controller:8880 --sync-interval 60000 # Start with specific load balancing strategy ./eldric-routerd --port 8881 --controller http://controller:8880 --strategy least_connections # Start with AI routing enabled ./eldric-routerd --port 8881 \ --controller http://controller:8880 \ --ai-mode autonomous \ --ai-model llama3.2:3b \ --ollama-url http://localhost:11434 # Start with manually specified workers (no controller) ./eldric-routerd --port 8881 \ --workers http://10.3.7.47:8890,http://10.19.0.12:8890,http://10.19.0.13:8890 # Start from config file ./eldric-routerd --config /etc/eldric/router.json # Build the router daemon cd cpp/build cmake -DBUILD_DISTRIBUTED=ON .. make eldric-routerd

Quick Start

Get a router running in under a minute.

Step 1: Build

cd cpp/build cmake -DBUILD_DISTRIBUTED=ON .. make eldric-routerd

Step 2: Start the Router

# Connect to an existing controller ./eldric-routerd --port 8881 --controller http://localhost:8880

Step 3: Send a Request

curl -X POST http://localhost:8881/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello!"}] }'

Port Reference

Component Port Protocol Description
Router 8881 HTTP/REST Load balancing and request routing
Edge Server 443 HTTPS TLS termination and authentication
Controller 8880 HTTP/REST Cluster management and worker registry
Worker 8890 HTTP/REST AI inference via backend (Ollama, vLLM, etc.)
Cloud Worker 8889 HTTP/REST Multi-backend cloud inference gateway
Data Worker 8892 HTTP/REST Database queries proxied via router