Model Management

Pull, distribute, and manage AI models across your Eldric cluster

v4.1.0

Overview

Eldric provides a unified model management system that lets you pull models from public registries, host your own custom models, and distribute them to every inference worker in your cluster — all from a single API or the controller dashboard.

Ollama Registry

Pull models directly from the Ollama Hub to all inference workers in parallel. Supports any model available in the Ollama registry — Llama, Qwen, Mistral, Gemma, DeepSeek, and thousands more.

  • Async pull with job tracking
  • Parallel download to all workers
  • Per-worker progress monitoring
  • Automatic model verification

Custom Model Registry

Host your own models on the Data Worker. Upload GGUF, safetensors, or training output from the Training Worker and make them available across the cluster.

  • GGUF and safetensors support
  • Training Worker integration
  • Version tracking and metadata
  • Multi-tenant isolation

Worker-to-Worker Distribution

Distribute models from any source — registry, another worker, a URL, or Ollama — to specific workers or the entire cluster with a single API call.

  • NFS path optimization (zero-copy)
  • Backend-specific installation
  • Coverage tracking per model
  • Selective or cluster-wide targeting

Architecture Flow

Model Distribution Architecture
Pull Registry Distribute Skipped Ollama Hub ollama.com/library Custom Registry Data Worker :8892 External URL HuggingFace, S3, etc. Controller :8880 - Orchestration Job Scheduling & Tracking ollama pull registry fetch URL download distribute 1 Worker 1 (Ollama) :8890 - Inference 2 Worker 2 (vLLM) :8890 - Inference 3 Worker 3 (llama.cpp) :8890 - Inference C Cloud Worker :8889 - Cloud Inference SKIPPED (NON-INFERENCE) Data :8892 Science :8897 Media :8894 Comm :8895

Pulling Models from Ollama

The controller provides a "Pull to All Workers" feature that triggers an asynchronous model pull across every inference worker in the cluster. Each worker pulls directly from the Ollama registry, and the controller tracks progress per worker.

How It Works

1 API request to controller
2 Controller creates pull job
3 Parallel pull to all workers
4 Per-worker progress tracking
Async Pull Job Flow
TIME 1 2 3 4 Dashboard POST /models/pull "Pull to All Workers" API call Controller Creates job job_id returned immediately background Parallel Pull Worker 1 100% Worker 2 67% Worker 3 45% Dashboard Polls GET /pull-jobs/{id} live status update poll every 2s PER-WORKER STATE MACHINE Pending Pulling (n%) Completed ! Error All workers completed: Job Status = "completed"

Pull a Model to All Workers

# Pull a model to every inference worker in the cluster curl -X POST http://controller:8880/api/v1/models/pull \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2:3b", "all": true }' # Response: a job ID for tracking { "job_id": "pull-a1b2c3d4", "model": "llama3.2:3b", "status": "running", "workers": 3, "started_at": "2026-03-12T10:30:00Z" }

Track Pull Progress

# Check the status of a pull job curl http://controller:8880/api/v1/models/pull-jobs/pull-a1b2c3d4 # Response: per-worker status { "job_id": "pull-a1b2c3d4", "model": "llama3.2:3b", "status": "running", "workers": { "wrk-worker1": { "status": "completed", "progress": 100 }, "wrk-worker2": { "status": "pulling", "progress": 67 }, "wrk-worker3": { "status": "pulling", "progress": 45 } } }

List All Pull Jobs

# List all active and recent pull jobs curl http://controller:8880/api/v1/models/pull-jobs # Returns array of pull job statuses

Custom Model Registry

The Custom Model Registry is hosted on the Data Worker and stores models that you upload manually, export from the Training Worker, or download from external sources. Registered models can be distributed to inference workers on demand.

GGUF Models

Quantized models for llama.cpp and Ollama. Efficient storage and fast inference on CPU and GPU.

Safetensors

Standard format for vLLM, TGI, and Triton backends. Full-precision or quantized weights.

Training Output

LoRA adapters and merged models from the Training Worker are automatically registered here.

Imported Models

Models downloaded from HuggingFace, custom URLs, or transferred from other clusters.

Registry Upload & Distribution Flow
STEP 1: Register Metadata $_ Client / CLI POST metadata Controller :8880 proxies to Data Worker proxy Data Worker :8892 - Model Registry stores metadata STEP 2: Upload Model File Client / CLI binary upload direct upload application/octet-stream STEP 3: Distribute to Cluster Controller fetch from registry Inference Worker 1 Ollama create from GGUF Inference Worker 2 vLLM model directory Inference Worker 3 llama.cpp models path NFS Optimization If workers share NFS with Data Worker, models are served via path (zero-copy). No transfer needed.

Register a Model

# Register a new model in the registry curl -X POST http://controller:8880/api/v1/model-registry/upload \ -H "Content-Type: application/json" \ -d '{ "name": "my-finetuned-llama", "format": "gguf", "size_bytes": 4200000000, "description": "Fine-tuned Llama 3.2 3B for customer support", "source": "training-worker", "metadata": { "base_model": "llama3.2:3b", "method": "lora", "epochs": 5 } }'

Upload Model File

# Upload the model file to the Data Worker curl -X POST http://dataworker:8892/api/v1/models/my-finetuned-llama/upload \ -H "Content-Type: application/octet-stream" \ --data-binary @my-model.gguf

List and Delete Registry Models

# List all registered models curl http://controller:8880/api/v1/model-registry # Delete a model from the registry curl -X DELETE http://controller:8880/api/v1/model-registry/my-finetuned-llama

Model Distribution

The distribution system takes a model from any source and installs it on target workers, handling backend-specific installation automatically. If workers share NFS storage with the Data Worker, models are accessed via path rather than copied.

Distribution Sources

Source Description Use When
registry Fetch from the Custom Model Registry on the Data Worker Distributing custom or fine-tuned models
worker Copy from one worker to another Replicating a model already on one node
url Download from an external URL (HuggingFace, S3, etc.) Pulling from external model hosting
ollama Pull from the Ollama registry Using standard Ollama models

Distribute a Model

# Distribute a custom model from the registry to all workers curl -X POST http://controller:8880/api/v1/models/distribute \ -H "Content-Type: application/json" \ -d '{ "model": "my-finetuned-llama", "source": "registry", "target_workers": "all" }' # Distribute to specific workers only curl -X POST http://controller:8880/api/v1/models/distribute \ -H "Content-Type: application/json" \ -d '{ "model": "my-finetuned-llama", "source": "registry", "target_workers": ["wrk-worker1", "wrk-worker2"] }'

Backend-Specific Installation

Backend Installation Method Notes
Ollama Generated Modelfile + ollama create Creates from GGUF, auto-generates template
vLLM Copy to model directory Safetensors or GGUF placed in serving path
llama.cpp Copy to model directory GGUF files placed in configured models path
Triton Model repository + load API Config.pbtxt generated, model loaded via API
TGI Copy to model directory Safetensors with tokenizer files

NFS Path Optimization

When inference workers have NFS mounts from the Data Worker, the distribution system detects the shared filesystem and configures backends to read directly from the NFS path. This avoids redundant file copies and saves significant disk space and transfer time. Configure NFS mounts via the Data Worker NFS integration.

Dashboard

The Models tab in the controller dashboard at http://controller:8880/dashboard provides a visual interface for all model management operations.

Download Model to Cluster

Text input with a "Pull to All Workers" button. Enter any Ollama model name and pull it to every inference worker with one click. Shows real-time progress bars per worker.

Pull Job Progress

Live view of active pull jobs with per-worker status indicators: pending, pulling (with percentage), completed, or failed. Historical jobs remain visible for reference.

Cluster Models View

Shows every model in the cluster, which workers have it, and coverage percentage. Quickly identify models that are only on some workers and distribute them with one click.

Custom Model Registry

Browse uploaded and trained models. View metadata including format, size, source, and creation date. Distribute or delete models directly from the registry view.

API Available Models

Aggregated list of all models available via the cluster API. This is the unified view that clients see when they call GET /api/v1/models.

Per-Worker Model Details

Drill down into any worker to see its local model list, sizes, modification dates, and backend type. Useful for debugging model availability issues.

API Endpoints

Method Endpoint Description
POST /api/v1/models/pull Pull a model from Ollama to workers (async job)
GET /api/v1/models/pull-jobs List all pull jobs (active and completed)
GET /api/v1/models/pull-jobs/{id} Get status and per-worker progress for a pull job
GET /api/v1/model-registry List all models in the custom registry
POST /api/v1/model-registry/upload Register a new model in the custom registry
DELETE /api/v1/model-registry/{id} Delete a model from the custom registry
POST /api/v1/models/distribute Distribute a model to workers from any source
GET /api/v1/models List all models aggregated across the cluster
POST /api/v1/models/show Get model details (template, system prompt, parameters)

Worker Type Targeting

Model pulls and distribution only target workers that actually serve inference. Non-inference workers are automatically skipped to avoid wasting bandwidth and storage.

Worker Type Port Receives Models Reason
Inference Worker 8890 Yes Primary inference endpoint (Ollama, vLLM, llama.cpp, etc.)
Cloud Worker 8889 Yes Cloud inference gateway with local model caching
Data Worker 8892 No Hosts registry only; does not serve inference
Science Worker 8897 No Scientific APIs; uses inference workers for LLM tasks
Training Worker 8898 No Pulls base models independently for training
Media Worker 8894 No Audio/video processing; uses separate STT/TTS models
Comm Worker 8895 No Messaging protocols; uses inference workers for AI replies

Quick Start

Get a model running across your entire cluster in three steps.

Step 1: Pull a Model

Initiate a cluster-wide pull from the Ollama registry.

curl -X POST http://controller:8880/api/v1/models/pull \ -H "Content-Type: application/json" \ -d '{"model": "qwen3:8b", "all": true}' # Note the job_id in the response

Step 2: Monitor Progress

Check the pull job until all workers report completion.

curl http://controller:8880/api/v1/models/pull-jobs/pull-a1b2c3d4 # Wait until all workers show "status": "completed"

Step 3: Verify Coverage

Confirm the model is available across the cluster.

# List all models — check that qwen3:8b appears with 100% coverage curl http://controller:8880/api/v1/models # Or get model details curl -X POST http://controller:8880/api/v1/models/show \ -H "Content-Type: application/json" \ -d '{"model": "qwen3:8b"}'

Related Components

Component Port Protocol Role in Model Management
Controller 8880 HTTP/REST Pull orchestration, job tracking, distribution API
Inference Worker 8890 HTTP/REST Model pull target, serves inference requests
Cloud Worker 8889 HTTP/REST Cloud inference with model caching
Data Worker 8892 HTTP/REST + NFS Custom model registry, NFS storage for models
Training Worker 8898 HTTP/REST Produces fine-tuned models for the registry