llama-llama3

Llama 3 by Meta — run Llama 3.3, Llama 3.2, and Llama 3.1 across your local device fleet. The most popular open-source LLM family routed to the best available machine. 8B for fast responses, 70B for quality, 405B for frontier performance. OpenAI-compatible API, Cross-platform (macOS, Linux, Windows). Zero cloud costs.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llama-llama3" with this command: npx skills add twinsgeeks/llama-llama3

Llama 3 — Run Meta's LLMs Across Your Local Fleet

The Llama family is the most widely deployed open-source LLM. This skill routes Llama requests across your devices — the fleet picks the best machine for every request automatically.

Supported Llama models

ModelParametersOllama nameBest for
Llama 3.370Bllama3.3:70bBest overall — matches GPT-4o on most benchmarks
Llama 3.21B, 3Bllama3.2:3bFast responses on low-RAM devices
Llama 3.18B, 70B, 405Bllama3.1:70bProven workhorse, massive community
Llama 38B, 70Bllama3:70bOriginal release, still widely used

Quick start

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/
herd                       # start the router (port 11435)
herd-node                  # run on each device — finds the router automatically

No models are downloaded during installation. Models are pulled on demand when a request arrives, or manually via the dashboard. All pulls require user confirmation.

Use Llama through the fleet

OpenAI SDK (drop-in replacement)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain transformer architecture"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

curl (Ollama format)

curl http://localhost:11435/api/chat -d '{
  "model": "llama3.3:70b",
  "messages": [{"role": "user", "content": "Write a Python quicksort"}],
  "stream": false
}'

curl (OpenAI format)

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}'

Which Llama model for your hardware

Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.

Pick the model that fits your available memory — smaller models work great for most tasks:

ModelMin RAMExample hardware
llama3.2:1b2GBAny Mac — even 8GB
llama3.2:3b4GBMac Mini (16GB)
llama3:8b8GBMac Mini (16GB)
llama3.3:70b48GBMac Studio M4 Max (128GB)
llama3.1:405b256GB+Mac Studio M4 Ultra (256GB) or distributed

The fleet router sends requests to the machine where the model is loaded. No manual routing needed.

Why run Llama locally

  • Free after hardware — Meta's license allows commercial use with no per-token cost
  • Privacy — prompts and responses never leave your network
  • No rate limits — your hardware, your throughput
  • Fleet routing — multiple machines share the load automatically

See what's running

# Models loaded in memory right now
curl -s http://localhost:11435/api/ps | python3 -m json.tool

# All models available across the fleet
curl -s http://localhost:11435/api/tags | python3 -m json.tool

Monitor Llama performance

# Recent request traces — see latency, tokens, which node handled each request
curl -s "http://localhost:11435/dashboard/api/traces?limit=10" | python3 -m json.tool

# Fleet health — 15 automated checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Web dashboard at http://localhost:11435/dashboard — live view of all nodes, queues, and models.

Also available on this fleet

Other LLM models

Qwen 3.5, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3, Codestral — any Ollama model routes through the same endpoint.

Image generation

curl http://localhost:11435/api/generate-image \
  -d '{"model": "z-image-turbo", "prompt": "a llama in the mountains", "width": 512, "height": 512}'

Speech-to-text

curl http://localhost:11435/api/transcribe -F "file=@recording.wav" -F "model=qwen3-asr"

Embeddings

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "Meta Llama open source language model"}'

Full documentation

Guardrails

  • Model downloads require explicit user confirmation — Llama models range from 1GB (1B) to 230GB+ (405B). Always confirm before pulling.
  • Model deletion requires explicit user confirmation.
  • Never delete or modify files in ~/.fleet-manager/.
  • If a model is too large for available memory, suggest a smaller variant.
  • No models are downloaded automatically — all pulls are user-initiated or require opt-in via the auto_pull setting.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Local Llm Router

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, a...

Registry SourceRecently Updated
2370Profile unavailable
Coding

Gemma Gemma3

Gemma 3 by Google — run Gemma 3 (4B, 12B, 27B) across your local device fleet. Google's most capable open model with 128K context, strong coding, and multili...

Registry SourceRecently Updated
1700Profile unavailable
General

Ollama Herd

Ollama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama...

Registry SourceRecently Updated
2350Profile unavailable
Coding

Phi Phi4

Phi 4 by Microsoft — small but powerful LLMs that run on minimal hardware. Phi-4 (14B), Phi-4-mini (3.8B), and Phi-3.5 across your device fleet. Perfect for...

Registry SourceRecently Updated
1222Profile unavailable
llama-llama3 | V50.AI