ollama-ollama-herd

Ollama Ollama Herd — multimodal Ollama model router that herds your Ollama LLMs into one smart Ollama endpoint. Route Ollama Llama, Qwen, DeepSeek, Phi, Mistral across macOS, Linux, and Windows devices. Self-hosted Ollama local AI with 7-signal Ollama scoring, Ollama auto-retry, VRAM-aware Ollama fallback. Plus Ollama image generation, speech-to-text, and embeddings. Drop-in OpenAI SDK compatible. Ollama本地推理路由 | Ollama enrutador IA local.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ollama-ollama-herd" with this command: npx skills add twinsgeeks/ollama-ollama-herd

Ollama — Herd Your Ollama LLMs Into One Endpoint

You have Ollama running on multiple machines. This skill gives you one Ollama endpoint that routes every Ollama request to the best available device automatically. No more hardcoding Ollama IPs, no more manual Ollama load balancing, no more "which Ollama machine has that model loaded?"

Setup Ollama Herd

pip install ollama-herd          # install the Ollama router
herd                             # start the Ollama router on port 11435
herd-node                        # run on each machine with Ollama installed

Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same Ollama models, smarter Ollama routing.

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use your Ollama models through the fleet

OpenAI SDK (drop-in Ollama routing)

# ollama_openai_client — route Ollama requests via OpenAI SDK
from openai import OpenAI

ollama_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
ollama_response = ollama_client.chat.completions.create(
    model="llama3.3:70b",  # any Ollama model
    messages=[{"role": "user", "content": "Hello from Ollama"}],
    stream=True,
)
for chunk in ollama_response:
    print(chunk.choices[0].delta.content or "", end="")

Ollama API (same as before, different port)

# Ollama chat — routed through the Ollama fleet
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3:235b",
  "messages": [{"role": "user", "content": "Hello via Ollama Herd"}],
  "stream": false
}'

# List all Ollama models across all machines
curl http://localhost:11435/api/tags

# Ollama models currently in GPU memory
curl http://localhost:11435/api/ps

# Ollama embeddings
curl http://localhost:11435/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Ollama embedding search query"
}'

What the Ollama router does

When an Ollama request comes in, the Ollama router scores every online Ollama node on 7 signals:

  1. Ollama Thermal — is the Ollama model already loaded in GPU memory? (+50 for hot)
  2. Ollama Memory fit — how much headroom does the Ollama node have?
  3. Ollama Queue depth — how many Ollama requests are waiting?
  4. Ollama Wait time — estimated latency based on Ollama history
  5. Ollama Role affinity — large Ollama models prefer big machines
  6. Ollama Availability — is the Ollama node reliably available?
  7. Ollama Context fit — does the loaded Ollama context window fit the request?

The highest-scoring Ollama node handles the request. If it fails, the Ollama router retries on the next best node automatically.

Supported Ollama models

Any model that runs on Ollama works through the Ollama fleet. Popular Ollama models:

Ollama ModelSizesBest for
llama3.38B, 70BGeneral purpose Ollama inference
qwen30.6B–235BMultilingual Ollama reasoning
qwen3.50.8B–397BLatest generation Ollama model
deepseek-v3671B (37B active)Ollama GPT-4o alternative
deepseek-r11.5B–671BOllama reasoning (like o3)
phi414BSmall, fast Ollama model
mistral7BFast Ollama European languages
gemma31B–27BGoogle's open Ollama model
codestral22BOllama code generation
qwen3-coder30B (3.3B active)Agentic Ollama coding
nomic-embed-text137MOllama embeddings for RAG

Ollama Resilience features

  • Ollama Auto-retry — re-routes to next best Ollama node on failure (before first chunk)
  • Ollama VRAM-aware fallback — routes to a loaded Ollama model in the same category instead of cold-loading
  • Ollama Context protection — prevents num_ctx from triggering expensive Ollama model reloads
  • Ollama Zombie reaper — cleans up stuck in-flight Ollama requests
  • Ollama Auto-pull — downloads missing Ollama models to the best node automatically

Also available via Ollama Herd

The same Ollama fleet router handles three more workloads:

Ollama Image generation

curl -o image.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model":"z-image-turbo","prompt":"a sunset via Ollama Herd","width":1024,"height":1024,"steps":4}'

Ollama Speech-to-text

curl http://localhost:11435/api/transcribe -F "audio=@recording.wav"

Ollama Embeddings

curl http://localhost:11435/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Ollama embedding text"}'

Ollama Dashboard

http://localhost:11435/dashboard — 8 tabs: Ollama Fleet Overview, Trends, Ollama Model Insights, Apps, Benchmarks, Ollama Health, Recommendations, Settings. Real-time Ollama queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.

Ollama Request tagging

Track per-project Ollama usage:

ollama_response = ollama_client.chat.completions.create(
    model="llama3.3:70b",  # Ollama model
    messages=messages,
    extra_body={"metadata": {"tags": ["my-ollama-project", "reasoning"]}},
)

Full Ollama documentation

Ollama Agent Setup Guide

Ollama Guardrails

  • Never restart the Ollama router or Ollama node agents without user confirmation.
  • Never delete or modify files in ~/.fleet-manager/ (Ollama data).
  • Never pull or delete Ollama models without user confirmation.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ollama Herd

Ollama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama...

Registry SourceRecently Updated
2380Profile unavailable
Coding

Local Llm Router

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

Registry SourceRecently Updated
2380Profile unavailable
Coding

Ollama — Herd Your LLMs Into One Smart Endpoint

Ollama fleet router — herd your Ollama LLMs into one smart endpoint. Route Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices with 7-sign...

Registry SourceRecently Updated
872Profile unavailable
Coding

Apple Silicon Ai

Apple Silicon AI — run LLMs, image generation, speech-to-text, and embeddings on Mac Studio, Mac Mini, MacBook Pro, and Mac Pro. Turn your Apple Silicon devi...

Registry SourceRecently Updated
1402Profile unavailable