ollama-load-balancer

Ollama load balancer for Llama, Qwen, DeepSeek, and Mistral inference across multiple machines. Load balancing with auto-discovery via mDNS, health checks, queue management, automatic failover, retry on node failure, and zombie request cleanup. Zero configuration. 负载均衡Ollama推理分发。Balanceador de carga Ollama para inferencia distribuida.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ollama-load-balancer" with this command: npx skills add twinsgeeks/ollama-load-balancer

Ollama Load Balancer

You are managing an Ollama load balancer that distributes inference requests across multiple Ollama instances with automatic discovery, health monitoring, and failover. The load balancer handles all routing decisions transparently.

What the load balancer solves

Ollama has no built-in load balancing. One machine goes down, your app gets errors. No health checks, no failover, no queue management. You're manually pointing clients at specific machines and hoping they stay up.

This load balancer auto-discovers Ollama instances via mDNS, monitors their health continuously, and distributes load based on real-time scoring. The load balancer automatically retries on failure. Zero config files. Zero Docker. pip install ollama-herd, run two commands, and load balancing is active.

Deploy the load balancer

pip install ollama-herd
herd              # start the load balancer on port 11435
herd-node         # start load balancer backend node on each machine

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Load Balancer Endpoint

The load balancer runs at http://localhost:11435. Drop-in replacement for direct Ollama connections — same API, same model names, with load balancing built in.

from openai import OpenAI
# Load balancer client — requests are balanced across all backend nodes
load_balancer_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
load_balanced_response = load_balancer_client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain load balancing for LLM inference"}]
)

Load Balancer Health Monitoring

Fleet-wide load balancer health check (15 automated checks)

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

The load balancer checks: offline nodes, degraded nodes, memory pressure, underutilized nodes, model thrashing, request timeouts, error rates. Each load balancer check returns severity (info/warning/critical) and recommendations.

Load balancer node status and metrics

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

Returns per-node: status (online/degraded/offline), CPU utilization, memory usage, loaded models with context lengths, and load balancer queue depths (pending/in-flight/done/failed).

Load balancer queue depths

curl -s http://localhost:11435/fleet/status | python3 -c "
import sys, json
# Load balancer queue inspection
data = json.load(sys.stdin)
for key, q in data.get('queues', {}).items():
    print(f\"{key}: {q['pending']} pending, {q['in_flight']}/{q['max_concurrent']} in-flight\")
"

Load Balancer Auto-Recovery

  • Load balancer auto-retry — if a node fails before the first response chunk, the load balancer re-scores and retries on the next-best node (up to 2 retries, configurable via FLEET_MAX_RETRIES)
  • Load balancer zombie reaper — background task detects in-flight requests stuck longer than 10 minutes and cleans them up
  • Load balancer context protection — strips dangerous num_ctx parameters that would trigger model reloads
  • Load balancer VRAM-aware fallback — routes to an already-loaded model instead of triggering a cold load
  • Load balancer auto-pull — optionally pulls missing models (disabled by default, toggle via settings)
  • Load balancer holding queue — when all nodes are busy, requests wait (up to 30s) rather than failing

Load Balancer API Endpoints

Models available through the load balancer

# All models across the load-balanced fleet
curl -s http://localhost:11435/api/tags | python3 -m json.tool

# Models currently loaded in load balancer backend memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

# OpenAI-compatible model list via load balancer
curl -s http://localhost:11435/v1/models | python3 -m json.tool

Load balancer request traces

curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

Load balancer usage statistics

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Load balancer model recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Load balancer settings (runtime toggles)

# View load balancer config
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

# Toggle load balancer features
curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Load balancer model management

# View per-node model details behind the load balancer
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull a model to a load balancer backend node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "load-balancer-node-1"}'

# Delete a model from a load balancer node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "load-balancer-node-1"}'

Load balancer per-app analytics

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Load Balancer Dashboard

Web dashboard at http://localhost:11435/dashboard with eight tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. All load balancer data updates in real-time via Server-Sent Events.

Load Balancer Operational Queries

Recent load balancer failures with error details

sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message, latency_ms/1000.0 as secs FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"

Load balancer retry frequency by node

sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, SUM(retry_count) as retries, COUNT(*) as total FROM request_traces GROUP BY node_id ORDER BY retries DESC"

Load balancer requests per hour

sqlite3 ~/.fleet-manager/latency.db "SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour, COUNT(*) as requests FROM request_traces GROUP BY hour ORDER BY hour"

Test load balancer inference

curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Test load balancing across nodes"}],"stream":false}'

curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Verify load balancer routing"}],"stream":false}'

Load Balancer Guardrails

  • Never restart or stop the load balancer or node agents without explicit user confirmation.
  • Never delete or modify files in ~/.fleet-manager/ (contains load balancer latency data, traces, and logs).
  • Do not pull or delete models on load balancer nodes without user confirmation — downloads can be 10-100+ GB.
  • If a load balancer node shows as offline, report it rather than attempting to SSH into the machine.
  • If all load balancer nodes are saturated, suggest the user check the dashboard.

Load Balancer Failure Handling

  • Connection refused → load balancer may not be running, suggest herd or uv run herd
  • 0 nodes online → suggest starting herd-node on load balancer backend devices
  • mDNS discovery fails → use --router-url http://router-ip:11435
  • Load balancer requests hang → check for num_ctx in client requests; verify with grep "Context protection" ~/.fleet-manager/logs/herd.jsonl
  • Load balancer API errors → check ~/.fleet-manager/logs/herd.jsonl

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ollama Herd

Ollama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama...

Registry SourceRecently Updated
2380Profile unavailable
General

Ollama Ollama Herd

Ollama Ollama Herd — multimodal Ollama model router that herds your Ollama LLMs into one smart Ollama endpoint. Route Ollama Llama, Qwen, DeepSeek, Phi, Mist...

Registry SourceRecently Updated
1542Profile unavailable
Coding

Local Llm Router

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

Registry SourceRecently Updated
2380Profile unavailable
Coding

Ollama — Herd Your LLMs Into One Smart Endpoint

Ollama fleet router — herd your Ollama LLMs into one smart endpoint. Route Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices with 7-sign...

Registry SourceRecently Updated
872Profile unavailable