Self-Hosted AI — Own Your Entire AI Stack

Stop paying per token. Stop sending data to cloud APIs. Run self-hosted LLMs, self-hosted image generation, self-hosted speech-to-text, and self-hosted embeddings on your own hardware. One self-hosted router makes all your devices act like one system.

What self-hosted AI replaces

Cloud service	Self-hosted replacement	How
OpenAI API	Self-hosted Llama 3.3, Qwen 3.5, DeepSeek-R1 via Ollama	Same OpenAI SDK, swap the base URL
DALL-E / Midjourney	Self-hosted Stable Diffusion 3, Flux via mflux/DiffusionKit	`POST /api/generate-image`
Whisper API	Self-hosted Qwen3-ASR via MLX	`POST /api/transcribe`
OpenAI Embeddings	Self-hosted nomic-embed-text, mxbai-embed via Ollama	`POST /api/embed`

Same APIs. Same quality. Zero per-request costs. All data stays on your self-hosted machines.

Self-Hosted Setup

pip install ollama-herd    # Self-hosted AI router from PyPI
herd                       # start the self-hosted router
herd-node                  # run on each self-hosted machine — auto-discovers the router

No Docker. No Kubernetes. No config files. Self-hosted devices find each other automatically on your local network.

Self-Hosted LLM Inference

Drop-in self-hosted replacement for the OpenAI SDK:

from openai import OpenAI

# Self-hosted inference client — replaces OpenAI cloud
self_hosted_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")

self_hosted_response = self_hosted_client.chat.completions.create(
    model="llama3.3:70b",  # self-hosted model, no cloud dependency
    messages=[{"role": "user", "content": "Analyze this contract for risks"}],
    stream=True,
)
for chunk in self_hosted_response:
    print(chunk.choices[0].delta.content or "", end="")

Self-hosted Ollama API

curl http://localhost:11435/api/chat -d '{
  "model": "deepseek-r1:70b",
  "messages": [{"role": "user", "content": "Explain self-hosted AI advantages over cloud APIs"}],
  "stream": false
}'

Self-Hosted Image Generation

Self-hosted replacement for DALL-E and Midjourney:

# Install self-hosted image backends on any node
uv tool install mflux           # Self-hosted Flux models (~7s)
uv tool install diffusionkit    # Self-hosted Stable Diffusion 3/3.5

# Generate on your self-hosted fleet
curl -o self_hosted_output.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "z-image-turbo", "prompt": "self-hosted AI generating product mockup", "width": 1024, "height": 1024}'

Self-Hosted Speech-to-Text

Self-hosted replacement for Whisper API:

curl http://localhost:11435/api/transcribe \
  -F "file=@self_hosted_meeting.wav" \
  -F "model=qwen3-asr"

All self-hosted transcription stays on your network. No audio data sent to cloud services.

Self-Hosted Embeddings

Self-hosted replacement for OpenAI's embedding API:

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "self-hosted document embedding for private RAG pipelines"}'

Self-Hosted Cost Comparison

Service	Cloud cost	Self-hosted cost
GPT-4o (1M tokens/month)	~$15-30/month	$0 (self-hosted hardware you own)
DALL-E (1000 images/month)	~$40/month	$0 (self-hosted image gen)
Whisper API (10 hours audio/month)	~$6/month	$0 (self-hosted transcription)
OpenAI embeddings (1M tokens/month)	~$0.10/month	$0 (self-hosted embeddings)
Total	~$60+/month	$0/month self-hosted

After hardware investment, every self-hosted request is free forever. No rate limits, no usage caps, no surprise bills.

Self-Hosted Advantages

Self-hosted data sovereignty — prompts, images, audio, and documents never leave your network
Self-hosted throughput — your hardware, no rate limits
Self-hosted uptime — cloud API outages don't affect your self-hosted fleet
Self-hosted flexibility — switch models instantly, no vendor lock-in
Self-hosted compliance — HIPAA, GDPR, SOC2 — no third-party data processors
Self-hosted predictability — hardware depreciates, but never surprises you with a bill

Self-Hosted Fleet Routing

The self-hosted router scores each device on 7 signals and picks the best one for every request. Multiple self-hosted machines share the load automatically.

# Self-hosted fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# Self-hosted health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

# Self-hosted model recommendations for your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Self-hosted dashboard at http://localhost:11435/dashboard for visual monitoring of your entire self-hosted fleet.

Full self-hosted documentation

Agent Setup Guide — all 4 self-hosted model types
Image Generation Guide — 3 self-hosted image backends
API Reference
Configuration

Contribute

Ollama Herd is open source (MIT). Self-hosted AI for everyone:

Star on GitHub — help others discover self-hosted AI
Open an issue — share your self-hosted setup
PRs welcome from humans and AI agents. CLAUDE.md gives full self-hosted context. 444 tests.

Self-Hosted Guardrails

No automatic downloads — all self-hosted model pulls require explicit user confirmation.
Self-hosted model deletion requires explicit user confirmation.
All self-hosted requests stay local — no data leaves your network. No telemetry, no analytics, no cloud callbacks.
Never delete or modify self-hosted files in ~/.fleet-manager/.
Your self-hosted fleet has zero cloud dependencies — works fully offline after initial model downloads.

self-hosted-ai

Safety Notice

Copy this and send it to your AI assistant to learn