ollama-herd

Ollama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama local AI (macOS, Linux, Windows) with 7-signal scoring, Ollama queue management, real-time dashboard, and Ollama health monitoring. Routes Ollama LLM, image, STT, and embedding requests across macOS, Linux, and Windows devices. Ollama本地推理路由 | Ollama enrutador IA local. Use when the user asks about their Ollama fleet, Ollama inference routing, Ollama node status, or Ollama fleet performance.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ollama-herd" with this command: npx skills add twinsgeeks/ollama-herd

Ollama Herd Fleet Manager

You are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device.

Install Ollama Herd

pip install ollama-herd          # install Ollama Herd from PyPI
herd                             # start the Ollama router
herd-node                        # start an Ollama node agent (run on each device)

PyPI: ollama-herd | Source: github.com/geeks-accelerator/ollama-herd

Ollama Router endpoint

The Ollama Herd router runs at http://localhost:11435 by default. If the user has specified a different Ollama URL, use that instead.

Ollama API endpoints

Use curl to interact with the Ollama fleet:

Ollama fleet status — overview of all Ollama nodes and queues

# ollama_fleet_status — check Ollama node health
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

Returns:

  • fleet.nodes_total / fleet.nodes_online — how many Ollama devices are in the fleet
  • fleet.models_loaded — total Ollama models currently loaded across all nodes
  • fleet.requests_active — total in-flight Ollama requests
  • nodes[] — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengths
  • queues — per Ollama node:model queue depths (pending, in-flight, done, failed)

List all Ollama models available across the fleet

# ollama_model_list — all Ollama models on all nodes
curl -s http://localhost:11435/api/tags | python3 -m json.tool

Pull an Ollama model onto the fleet

# ollama_pull_model — pull a model (auto-selects best node, streams progress)
curl -N http://localhost:11435/api/pull -d '{"name": "codestral"}'

# pull to a specific node
curl -N http://localhost:11435/api/pull -d '{"name": "llama3.3:70b", "node_id": "mac-studio"}'

# non-streaming (blocks until complete)
curl http://localhost:11435/api/pull -d '{"name": "phi4", "stream": false}'

List Ollama models currently loaded in memory

# ollama_loaded_models — hot Ollama models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

OpenAI-compatible Ollama model list

curl -s http://localhost:11435/v1/models | python3 -m json.tool

Ollama usage statistics (per-node, per-model daily aggregates)

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Recent Ollama request traces

# ollama_traces — recent Ollama routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags.

Ollama fleet health analysis

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMA_NUM_PARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams.

Ollama model recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data.

Ollama settings

# View current Ollama config and node versions
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

# Toggle Ollama runtime settings (auto_pull, vram_fallback)
curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Ollama model management

# View per-node Ollama model details with sizes and usage
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull an Ollama model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Delete an Ollama model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Ollama model insights (summary statistics)

curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool

Per-app Ollama analytics (requires request tagging)

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Ollama Dashboard

The Ollama web dashboard is at http://localhost:11435/dashboard. It has eight tabs:

  • Fleet Overview — live Ollama node cards, queue depths, and request counts via SSE
  • Trends — Ollama requests per hour, average latency, and token throughput charts (24h–7d)
  • Model Insights — per-Ollama-model latency, tokens/sec, usage comparison
  • Apps — per-tag Ollama analytics with request volume, latency, tokens, error rates
  • Benchmarks — Ollama capacity growth over time with per-run throughput and latency percentiles
  • Health — 15 automated Ollama fleet health checks with severity levels
  • Recommendations — Ollama model mix recommendations per node with one-click pull
  • Settings — Ollama runtime toggle switches, read-only config tables, and node version tracking

Direct the user to open this URL in their browser for visual Ollama monitoring.

Ollama Resilience features

  • Auto-retry — if an Ollama node fails before the first response chunk, re-scores and retries on the next-best Ollama node (up to 2 retries)
  • Ollama model fallbacks — clients specify backup Ollama models; tries alternatives when the primary is unavailable
  • Context protection — strips num_ctx from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model
  • VRAM-aware fallback — routes to an already-loaded Ollama model in the same category instead of cold-loading
  • Zombie reaper — background task detects and cleans up stuck in-flight Ollama requests
  • Auto-pull — automatically pulls missing Ollama models onto the best available node

Common Ollama tasks

Check if the Ollama fleet is healthy

  1. Hit /fleet/status and verify nodes_online > 0
  2. Hit /dashboard/api/health for automated Ollama health checks with severity levels
  3. Look at Ollama queue depths — deep queues may indicate a bottleneck

Find which Ollama node has a specific model

  1. Hit /fleet/status and inspect each Ollama node's ollama.models_loaded and ollama.models_available
  2. Or hit /api/tags for a flat list of all available Ollama models with which nodes have them

Check if an Ollama model is loaded (hot) or cold

  1. Hit /api/ps — Ollama models listed here are currently loaded in memory (hot)
  2. Models in /api/tags but not in /api/ps are on disk but not loaded (cold)

View recent Ollama inference activity

  1. Hit /dashboard/api/traces?limit=10 to see the last 10 Ollama requests
  2. Each trace shows: Ollama model, node, score, latency, tokens, retry/fallback status

Diagnose slow Ollama responses

  1. Check /dashboard/api/traces for high latency Ollama entries
  2. Check /fleet/status for Ollama nodes with high queue depths or memory pressure
  3. Check if the Ollama model had to cold-load (look for low scores in trace)
  4. Check if num_ctx is being sent — Ollama context protection logs show if requests triggered reloads

Query the Ollama trace database directly

# Recent Ollama failures
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"

# Slowest Ollama requests
sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, latency_ms/1000.0 as secs FROM request_traces WHERE status='completed' ORDER BY latency_ms DESC LIMIT 10"

Test Ollama inference through the fleet

# Ollama via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'

# Ollama native format
curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'

Ollama Guardrails

  • Never restart or stop the Ollama Herd router or Ollama node agents without explicit user confirmation.
  • Never delete or modify files in ~/.fleet-manager/ (contains Ollama latency data, traces, and logs).
  • Do not pull Ollama models onto nodes without user confirmation — Ollama model downloads can be large (10-100+ GB).
  • Do not delete Ollama models without user confirmation.
  • If an Ollama node shows as offline, report it to the user rather than attempting to SSH into the machine.

Ollama Failure handling

  • If curl to the Ollama router fails with connection refused, tell the user the Ollama Herd router may not be running and suggest herd to start it.
  • If the Ollama fleet status shows 0 nodes online, suggest starting Ollama node agents with herd-node on their devices.
  • If Ollama mDNS discovery fails, suggest using --router-url http://router-ip:11435 for explicit connection.
  • If Ollama requests hang with 0 bytes returned, check if the client is sending num_ctx — Ollama context protection should strip it.
  • If a specific Ollama API endpoint returns an error, show the user the full error response and suggest checking the Ollama JSONL logs at ~/.fleet-manager/logs/herd.jsonl.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ollama Ollama Herd

Ollama Ollama Herd — multimodal Ollama model router that herds your Ollama LLMs into one smart Ollama endpoint. Route Ollama Llama, Qwen, DeepSeek, Phi, Mist...

Registry SourceRecently Updated
1592Profile unavailable
Coding

Local Llm Router

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

Registry SourceRecently Updated
2470Profile unavailable
General

Ollama Load Balancer

Ollama load balancer for Llama, Qwen, DeepSeek, and Mistral inference across multiple machines. Load balancing with auto-discovery via mDNS, health checks, q...

Registry SourceRecently Updated
2570Profile unavailable
Coding

Apple Silicon Ai

Apple Silicon AI — run LLMs, image generation, speech-to-text, and embeddings on Mac Studio, Mac Mini, MacBook Pro, and Mac Pro. Turn your Apple Silicon devi...

Registry SourceRecently Updated
1472Profile unavailable