LLM Gateway & Routing
Configure multi-model access, fallbacks, cost optimization, and A/B testing.
Why Use a Gateway?
Without gateway:
-
Vendor lock-in (one provider)
-
No fallbacks (provider down = app down)
-
Hard to A/B test models
-
Scattered API keys and configs
With gateway:
-
Single API for 400+ models
-
Automatic fallbacks
-
Easy model switching
-
Unified cost tracking
Quick Decision
Need Solution
Fastest setup, multi-model OpenRouter
Full control, self-hosted LiteLLM
Observability + routing Helicone
Enterprise, guardrails Portkey
OpenRouter (Recommended)
Why OpenRouter
-
400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more
-
Single API: One key for all providers
-
Automatic fallbacks: Built-in reliability
-
A/B testing: Easy model comparison
-
Cost tracking: Unified billing dashboard
-
Free credits: $1 free to start
Setup
1. Sign up at openrouter.ai
2. Get API key from dashboard
3. Add to .env:
OPENROUTER_API_KEY=sk-or-v1-...
Basic Usage
// Using fetch
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': Bearer ${process.env.OPENROUTER_API_KEY},
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-3-5-sonnet',
messages: [{ role: 'user', content: 'Hello!' }],
}),
});
With Vercel AI SDK (Recommended)
import { createOpenAI } from "@ai-sdk/openai"; import { generateText } from "ai";
const openrouter = createOpenAI({ baseURL: "https://openrouter.ai/api/v1", apiKey: process.env.OPENROUTER_API_KEY, });
const { text } = await generateText({ model: openrouter("anthropic/claude-3-5-sonnet"), prompt: "Explain quantum computing", });
Model IDs
// Format: provider/model-name const models = { // Anthropic claude35Sonnet: "anthropic/claude-3-5-sonnet", claudeHaiku: "anthropic/claude-3-5-haiku",
// OpenAI gpt4o: "openai/gpt-4o", gpt4oMini: "openai/gpt-4o-mini",
// Google geminiPro: "google/gemini-pro-1.5", geminiFlash: "google/gemini-flash-1.5",
// Meta llama3: "meta-llama/llama-3.1-70b-instruct",
// Auto (OpenRouter picks best) auto: "openrouter/auto", };
Fallback Chains
// Define fallback order const modelChain = [ "anthropic/claude-3-5-sonnet", // Primary "openai/gpt-4o", // Fallback 1 "google/gemini-pro-1.5", // Fallback 2 ];
async function callWithFallback(messages: Message[]) {
for (const model of modelChain) {
try {
return await openrouter.chat({ model, messages });
} catch (error) {
console.log(${model} failed, trying next...);
}
}
throw new Error("All models failed");
}
Cost Routing
// Route based on query complexity function selectModel(query: string): string { const complexity = analyzeComplexity(query);
if (complexity === "simple") { // Simple queries → cheap model return "openai/gpt-4o-mini"; // ~$0.15/1M tokens } else if (complexity === "medium") { // Medium → balanced return "google/gemini-flash-1.5"; // ~$0.075/1M tokens } else { // Complex → best quality return "anthropic/claude-3-5-sonnet"; // ~$3/1M tokens } }
function analyzeComplexity(query: string): "simple" | "medium" | "complex" { // Simple heuristics if (query.length < 50) return "simple"; if (query.includes("explain") || query.includes("analyze")) return "complex"; return "medium"; }
A/B Testing
// Random assignment function getModel(userId: string): string { const hash = userId.charCodeAt(0) % 100;
if (hash < 50) { return "anthropic/claude-3-5-sonnet"; // 50% } else { return "openai/gpt-4o"; // 50% } }
// Track which model was used const model = getModel(userId); const response = await openrouter.chat({ model, messages }); await analytics.track("llm_call", { model, userId, latency, cost });
LiteLLM (Self-Hosted)
Why LiteLLM
-
Self-hosted: Full control over data
-
100+ providers: Same coverage as OpenRouter
-
Load balancing: Distribute across providers
-
Cost tracking: Built-in spend management
-
Caching: Redis or in-memory
-
Rate limiting: Per-user limits
Setup
Install
pip install litellm[proxy]
Run proxy
litellm --config config.yaml
Use as OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:4000
Configuration
config.yaml
model_list:
Claude models
- model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-...
OpenAI models
- model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-...
Load balanced (multiple providers)
- model_name: balanced
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
litellm_params:
model: openai/gpt-4o
Requests distributed across both
General settings
general_settings: master_key: sk-master-... database_url: postgresql://...
Routing
router_settings: routing_strategy: simple-shuffle # or latency-based-routing num_retries: 3 timeout: 30
Rate limiting
litellm_settings: max_budget: 100 # $100/month budget_duration: monthly
Fallbacks in LiteLLM
model_list:
- model_name: primary
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
fallbacks:
- model_name: fallback-1 litellm_params: model: openai/gpt-4o
- model_name: fallback-2 litellm_params: model: google/gemini-pro
Usage
// Use like OpenAI SDK import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:4000", apiKey: "sk-master-...", });
const response = await client.chat.completions.create({ model: "claude-sonnet", // Maps to configured model messages: [{ role: "user", content: "Hello!" }], });
Routing Strategies
- Cost-Based Routing
const costTiers = { cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"], balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"], premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"], };
function routeByCost(budget: "cheap" | "balanced" | "premium"): string { const models = costTiers[budget]; return models[Math.floor(Math.random() * models.length)]; }
- Latency-Based Routing
// Track latency per model const latencyStats: Record<string, number[]> = {};
function routeByLatency(): string { const avgLatencies = Object.entries(latencyStats) .map(([model, times]) => ({ model, avg: times.reduce((a, b) => a + b, 0) / times.length, })) .sort((a, b) => a.avg - b.avg);
return avgLatencies[0].model; }
// Update after each call function recordLatency(model: string, latencyMs: number) { if (!latencyStats[model]) latencyStats[model] = []; latencyStats[model].push(latencyMs); // Keep last 100 samples if (latencyStats[model].length > 100) { latencyStats[model].shift(); } }
- Task-Based Routing
const taskModels = { coding: "anthropic/claude-3-5-sonnet", // Best for code reasoning: "openai/o1-preview", // Best for logic creative: "anthropic/claude-3-5-sonnet", // Best for writing simple: "openai/gpt-4o-mini", // Cheap and fast multimodal: "google/gemini-pro-1.5", // Vision + text };
function routeByTask(task: keyof typeof taskModels): string { return taskModels[task]; }
- Hybrid Routing
interface RoutingConfig { task: string; maxCost: number; maxLatency: number; }
function hybridRoute(config: RoutingConfig): string { // Filter by cost const affordable = models.filter(m => m.cost <= config.maxCost);
// Filter by latency const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);
// Select best for task const taskScores = fast.map(m => ({ model: m.id, score: getTaskScore(m.id, config.task), }));
return taskScores.sort((a, b) => b.score - a.score)[0].model; }
Best Practices
- Always Have Fallbacks
// Bad: Single point of failure const response = await openai.chat({ model: "gpt-4o", messages });
// Good: Fallback chain const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"]; for (const model of models) { try { return await gateway.chat({ model, messages }); } catch (e) { continue; } }
- Pin Model Versions
// Bad: Model can change const model = "gpt-4";
// Good: Pinned version const model = "openai/gpt-4-0125-preview";
- Track Costs
// Log every call async function trackedCall(model: string, messages: Message[]) { const start = Date.now(); const response = await gateway.chat({ model, messages }); const latency = Date.now() - start;
await analytics.track("llm_call", { model, inputTokens: response.usage.prompt_tokens, outputTokens: response.usage.completion_tokens, cost: calculateCost(model, response.usage), latency, });
return response; }
- Set Token Limits
// Prevent runaway costs const response = await gateway.chat({ model, messages, max_tokens: 500, // Limit output length });
- Use Caching
// LiteLLM caching litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour
References
-
references/openrouter-guide.md
-
OpenRouter deep dive
-
references/litellm-guide.md
-
LiteLLM self-hosting
-
references/routing-strategies.md
-
Advanced routing patterns
-
references/alternatives.md
-
Helicone, Portkey, etc.
Templates
-
templates/openrouter-config.ts
-
TypeScript OpenRouter setup
-
templates/litellm-config.yaml
-
LiteLLM proxy config
-
templates/fallback-chain.ts
-
Fallback implementation