Ollama Local Inference
Run LLMs locally for cost savings, privacy, and offline development.
Quick Start
Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
Pull models
ollama pull deepseek-r1:70b # Reasoning (GPT-4 level) ollama pull qwen2.5-coder:32b # Coding ollama pull nomic-embed-text # Embeddings
Start server
ollama serve
Recommended Models (M4 Max 256GB)
Task Model Size Notes
Reasoning deepseek-r1:70b
~42GB GPT-4 level
Coding qwen2.5-coder:32b
~35GB 73.7% Aider benchmark
Embeddings nomic-embed-text
~0.5GB 768 dims, fast
General llama3.3:70b
~40GB Good all-around
LangChain Integration
from langchain_ollama import ChatOllama, OllamaEmbeddings
Chat model
llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, # Context window keep_alive="5m", # Keep model loaded )
Embeddings
embeddings = OllamaEmbeddings( model="nomic-embed-text", base_url="http://localhost:11434", )
Generate
response = await llm.ainvoke("Explain async/await") vector = await embeddings.aembed_query("search text")
Tool Calling with Ollama
from langchain_core.tools import tool
@tool def search_docs(query: str) -> str: """Search the document database.""" return f"Found results for: {query}"
Bind tools
llm_with_tools = llm.bind_tools([search_docs]) response = await llm_with_tools.ainvoke("Search for Python patterns")
Structured Output
from pydantic import BaseModel, Field
class CodeAnalysis(BaseModel): language: str = Field(description="Programming language") complexity: int = Field(ge=1, le=10) issues: list[str] = Field(description="Found issues")
structured_llm = llm.with_structured_output(CodeAnalysis) result = await structured_llm.ainvoke("Analyze this code: ...")
result is typed CodeAnalysis object
Provider Factory Pattern
import os
def get_llm_provider(task_type: str = "general"): """Auto-switch between Ollama and cloud APIs.""" if os.getenv("OLLAMA_ENABLED") == "true": models = { "reasoning": "deepseek-r1:70b", "coding": "qwen2.5-coder:32b", "general": "llama3.3:70b", } return ChatOllama( model=models.get(task_type, "llama3.3:70b"), keep_alive="5m" ) else: # Fall back to cloud API return ChatOpenAI(model="gpt-5.2")
Usage
llm = get_llm_provider(task_type="coding")
Environment Configuration
.env.local
OLLAMA_ENABLED=true OLLAMA_HOST=http://localhost:11434 OLLAMA_MODEL_REASONING=deepseek-r1:70b OLLAMA_MODEL_CODING=qwen2.5-coder:32b OLLAMA_MODEL_EMBED=nomic-embed-text
Performance tuning (Apple Silicon)
OLLAMA_MAX_LOADED_MODELS=3 # Keep 3 models in memory OLLAMA_KEEP_ALIVE=5m # 5 minute keep-alive
CI Integration
GitHub Actions (self-hosted runner)
jobs:
test:
runs-on: self-hosted # M4 Max runner
env:
OLLAMA_ENABLED: "true"
steps:
- name: Pre-warm models
run: |
curl -s http://localhost:11434/api/embeddings
-d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null
- name: Run tests
run: pytest tests/
Cost Comparison
Provider Monthly Cost Latency
Cloud APIs ~$675/month 200-500ms
Ollama Local ~$50 (electricity) 50-200ms
Savings 93% 2-3x faster
Best Practices
-
DO use keep_alive="5m" in CI (avoid cold starts)
-
DO pre-warm models before first call
-
DO set num_ctx=32768 on Apple Silicon
-
DO use provider factory for cloud/local switching
-
DON'T use keep_alive=-1 (wastes memory)
-
DON'T skip pre-warming in CI (30-60s cold start)
Troubleshooting
Check if Ollama is running
curl http://localhost:11434/api/tags
List loaded models
ollama list
Check model memory usage
ollama ps
Pull specific version
ollama pull deepseek-r1:70b-q4_K_M
Related Skills
-
embeddings
-
Embedding patterns (works with nomic-embed-text)
-
llm-evaluation
-
Testing with local models
-
cost-optimization
-
Broader cost strategies
Capability Details
setup
Keywords: setup, install, configure, ollama Solves:
-
Set up Ollama locally
-
Configure for development
-
Install models
model-selection
Keywords: model, llama, mistral, qwen, selection Solves:
-
Choose appropriate model
-
Compare model capabilities
-
Balance speed vs quality
provider-template
Keywords: provider, template, python, implementation Solves:
-
Ollama provider template
-
Python implementation
-
Drop-in LLM provider