Ollama Herd Fleet Manager

You are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device.

Install Ollama Herd

pip install ollama-herd          # install Ollama Herd from PyPI
herd                             # start the Ollama router
herd-node                        # start an Ollama node agent (run on each device)

PyPI: ollama-herd | Source: github.com/geeks-accelerator/ollama-herd

Ollama Router endpoint

The Ollama Herd router runs at http://localhost:11435 by default. If the user has specified a different Ollama URL, use that instead.

Ollama API endpoints

Use curl to interact with the Ollama fleet:

Ollama fleet status — overview of all Ollama nodes and queues

# ollama_fleet_status — check Ollama node health
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

Returns:

fleet.nodes_total / fleet.nodes_online — how many Ollama devices are in the fleet
fleet.models_loaded — total Ollama models currently loaded across all nodes
fleet.requests_active — total in-flight Ollama requests
nodes[] — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengths
queues — per Ollama node:model queue depths (pending, in-flight, done, failed)

List all Ollama models available across the fleet

# ollama_model_list — all Ollama models on all nodes
curl -s http://localhost:11435/api/tags | python3 -m json.tool

Pull an Ollama model onto the fleet

# ollama_pull_model — pull a model (auto-selects best node, streams progress)
curl -N http://localhost:11435/api/pull -d '{"name": "codestral"}'

# pull to a specific node
curl -N http://localhost:11435/api/pull -d '{"name": "llama3.3:70b", "node_id": "mac-studio"}'

# non-streaming (blocks until complete)
curl http://localhost:11435/api/pull -d '{"name": "phi4", "stream": false}'

List Ollama models currently loaded in memory

# ollama_loaded_models — hot Ollama models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

OpenAI-compatible Ollama model list

curl -s http://localhost:11435/v1/models | python3 -m json.tool

Ollama usage statistics (per-node, per-model daily aggregates)

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Recent Ollama request traces

# ollama_traces — recent Ollama routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags.

Ollama fleet health analysis

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMA_NUM_PARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams.

Ollama model recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data.

Ollama settings

# View current Ollama config and node versions
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

# Toggle Ollama runtime settings (auto_pull, vram_fallback)
curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Ollama model management

# View per-node Ollama model details with sizes and usage
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull an Ollama model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Delete an Ollama model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Ollama model insights (summary statistics)

curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool

Per-app Ollama analytics (requires request tagging)

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Ollama Dashboard

The Ollama web dashboard is at http://localhost:11435/dashboard. It has eight tabs:

Fleet Overview — live Ollama node cards, queue depths, and request counts via SSE
Trends — Ollama requests per hour, average latency, and token throughput charts (24h–7d)
Model Insights — per-Ollama-model latency, tokens/sec, usage comparison
Apps — per-tag Ollama analytics with request volume, latency, tokens, error rates
Benchmarks — Ollama capacity growth over time with per-run throughput and latency percentiles
Health — 15 automated Ollama fleet health checks with severity levels
Recommendations — Ollama model mix recommendations per node with one-click pull
Settings — Ollama runtime toggle switches, read-only config tables, and node version tracking

Direct the user to open this URL in their browser for visual Ollama monitoring.

Ollama Resilience features

Auto-retry — if an Ollama node fails before the first response chunk, re-scores and retries on the next-best Ollama node (up to 2 retries)
Ollama model fallbacks — clients specify backup Ollama models; tries alternatives when the primary is unavailable
Context protection — strips num_ctx from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model
VRAM-aware fallback — routes to an already-loaded Ollama model in the same category instead of cold-loading
Zombie reaper — background task detects and cleans up stuck in-flight Ollama requests
Auto-pull — automatically pulls missing Ollama models onto the best available node

Common Ollama tasks

Check if the Ollama fleet is healthy

Hit /fleet/status and verify nodes_online > 0
Hit /dashboard/api/health for automated Ollama health checks with severity levels
Look at Ollama queue depths — deep queues may indicate a bottleneck

Find which Ollama node has a specific model

Hit /fleet/status and inspect each Ollama node's ollama.models_loaded and ollama.models_available
Or hit /api/tags for a flat list of all available Ollama models with which nodes have them

Check if an Ollama model is loaded (hot) or cold

Hit /api/ps — Ollama models listed here are currently loaded in memory (hot)
Models in /api/tags but not in /api/ps are on disk but not loaded (cold)

View recent Ollama inference activity

Hit /dashboard/api/traces?limit=10 to see the last 10 Ollama requests
Each trace shows: Ollama model, node, score, latency, tokens, retry/fallback status

Diagnose slow Ollama responses

Check /dashboard/api/traces for high latency Ollama entries
Check /fleet/status for Ollama nodes with high queue depths or memory pressure
Check if the Ollama model had to cold-load (look for low scores in trace)
Check if num_ctx is being sent — Ollama context protection logs show if requests triggered reloads

Query the Ollama trace database directly

# Recent Ollama failures
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"

# Slowest Ollama requests
sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, latency_ms/1000.0 as secs FROM request_traces WHERE status='completed' ORDER BY latency_ms DESC LIMIT 10"

Test Ollama inference through the fleet

# Ollama via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'

# Ollama native format
curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'

Ollama Guardrails

Never restart or stop the Ollama Herd router or Ollama node agents without explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/ (contains Ollama latency data, traces, and logs).
Do not pull Ollama models onto nodes without user confirmation — Ollama model downloads can be large (10-100+ GB).
Do not delete Ollama models without user confirmation.
If an Ollama node shows as offline, report it to the user rather than attempting to SSH into the machine.

Ollama Failure handling

If curl to the Ollama router fails with connection refused, tell the user the Ollama Herd router may not be running and suggest herd to start it.
If the Ollama fleet status shows 0 nodes online, suggest starting Ollama node agents with herd-node on their devices.
If Ollama mDNS discovery fails, suggest using --router-url http://router-ip:11435 for explicit connection.
If Ollama requests hang with 0 bytes returned, check if the client is sending num_ctx — Ollama context protection should strip it.
If a specific Ollama API endpoint returns an error, show the user the full error response and suggest checking the Ollama JSONL logs at ~/.fleet-manager/logs/herd.jsonl.

ollama-herd

Safety Notice

Copy this and send it to your AI assistant to learn