llm-gateway-routing

LLM Gateway & Routing

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-gateway-routing" with this command: npx skills add phrazzld/claude-config/phrazzld-claude-config-llm-gateway-routing

LLM Gateway & Routing

Configure multi-model access, fallbacks, cost optimization, and A/B testing.

Why Use a Gateway?

Without gateway:

  • Vendor lock-in (one provider)

  • No fallbacks (provider down = app down)

  • Hard to A/B test models

  • Scattered API keys and configs

With gateway:

  • Single API for 400+ models

  • Automatic fallbacks

  • Easy model switching

  • Unified cost tracking

Quick Decision

Need Solution

Fastest setup, multi-model OpenRouter

Full control, self-hosted LiteLLM

Observability + routing Helicone

Enterprise, guardrails Portkey

OpenRouter (Recommended)

Why OpenRouter

  • 400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more

  • Single API: One key for all providers

  • Automatic fallbacks: Built-in reliability

  • A/B testing: Easy model comparison

  • Cost tracking: Unified billing dashboard

  • Free credits: $1 free to start

Setup

1. Sign up at openrouter.ai

2. Get API key from dashboard

3. Add to .env:

OPENROUTER_API_KEY=sk-or-v1-...

Basic Usage

// Using fetch const response = await fetch('https://openrouter.ai/api/v1/chat/completions', { method: 'POST', headers: { 'Authorization': Bearer ${process.env.OPENROUTER_API_KEY}, 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'anthropic/claude-3-5-sonnet', messages: [{ role: 'user', content: 'Hello!' }], }), });

With Vercel AI SDK (Recommended)

import { createOpenAI } from "@ai-sdk/openai"; import { generateText } from "ai";

const openrouter = createOpenAI({ baseURL: "https://openrouter.ai/api/v1", apiKey: process.env.OPENROUTER_API_KEY, });

const { text } = await generateText({ model: openrouter("anthropic/claude-3-5-sonnet"), prompt: "Explain quantum computing", });

Model IDs

// Format: provider/model-name const models = { // Anthropic claude35Sonnet: "anthropic/claude-3-5-sonnet", claudeHaiku: "anthropic/claude-3-5-haiku",

// OpenAI gpt4o: "openai/gpt-4o", gpt4oMini: "openai/gpt-4o-mini",

// Google geminiPro: "google/gemini-pro-1.5", geminiFlash: "google/gemini-flash-1.5",

// Meta llama3: "meta-llama/llama-3.1-70b-instruct",

// Auto (OpenRouter picks best) auto: "openrouter/auto", };

Fallback Chains

// Define fallback order const modelChain = [ "anthropic/claude-3-5-sonnet", // Primary "openai/gpt-4o", // Fallback 1 "google/gemini-pro-1.5", // Fallback 2 ];

async function callWithFallback(messages: Message[]) { for (const model of modelChain) { try { return await openrouter.chat({ model, messages }); } catch (error) { console.log(${model} failed, trying next...); } } throw new Error("All models failed"); }

Cost Routing

// Route based on query complexity function selectModel(query: string): string { const complexity = analyzeComplexity(query);

if (complexity === "simple") { // Simple queries → cheap model return "openai/gpt-4o-mini"; // ~$0.15/1M tokens } else if (complexity === "medium") { // Medium → balanced return "google/gemini-flash-1.5"; // ~$0.075/1M tokens } else { // Complex → best quality return "anthropic/claude-3-5-sonnet"; // ~$3/1M tokens } }

function analyzeComplexity(query: string): "simple" | "medium" | "complex" { // Simple heuristics if (query.length < 50) return "simple"; if (query.includes("explain") || query.includes("analyze")) return "complex"; return "medium"; }

A/B Testing

// Random assignment function getModel(userId: string): string { const hash = userId.charCodeAt(0) % 100;

if (hash < 50) { return "anthropic/claude-3-5-sonnet"; // 50% } else { return "openai/gpt-4o"; // 50% } }

// Track which model was used const model = getModel(userId); const response = await openrouter.chat({ model, messages }); await analytics.track("llm_call", { model, userId, latency, cost });

LiteLLM (Self-Hosted)

Why LiteLLM

  • Self-hosted: Full control over data

  • 100+ providers: Same coverage as OpenRouter

  • Load balancing: Distribute across providers

  • Cost tracking: Built-in spend management

  • Caching: Redis or in-memory

  • Rate limiting: Per-user limits

Setup

Install

pip install litellm[proxy]

Run proxy

litellm --config config.yaml

Use as OpenAI-compatible endpoint

export OPENAI_API_BASE=http://localhost:4000

Configuration

config.yaml

model_list:

Claude models

  • model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-...

OpenAI models

  • model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-...

Load balanced (multiple providers)

  • model_name: balanced litellm_params: model: anthropic/claude-3-5-sonnet-latest litellm_params: model: openai/gpt-4o

    Requests distributed across both

General settings

general_settings: master_key: sk-master-... database_url: postgresql://...

Routing

router_settings: routing_strategy: simple-shuffle # or latency-based-routing num_retries: 3 timeout: 30

Rate limiting

litellm_settings: max_budget: 100 # $100/month budget_duration: monthly

Fallbacks in LiteLLM

model_list:

  • model_name: primary litellm_params: model: anthropic/claude-3-5-sonnet-latest fallbacks:
    • model_name: fallback-1 litellm_params: model: openai/gpt-4o
    • model_name: fallback-2 litellm_params: model: google/gemini-pro

Usage

// Use like OpenAI SDK import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:4000", apiKey: "sk-master-...", });

const response = await client.chat.completions.create({ model: "claude-sonnet", // Maps to configured model messages: [{ role: "user", content: "Hello!" }], });

Routing Strategies

  1. Cost-Based Routing

const costTiers = { cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"], balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"], premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"], };

function routeByCost(budget: "cheap" | "balanced" | "premium"): string { const models = costTiers[budget]; return models[Math.floor(Math.random() * models.length)]; }

  1. Latency-Based Routing

// Track latency per model const latencyStats: Record<string, number[]> = {};

function routeByLatency(): string { const avgLatencies = Object.entries(latencyStats) .map(([model, times]) => ({ model, avg: times.reduce((a, b) => a + b, 0) / times.length, })) .sort((a, b) => a.avg - b.avg);

return avgLatencies[0].model; }

// Update after each call function recordLatency(model: string, latencyMs: number) { if (!latencyStats[model]) latencyStats[model] = []; latencyStats[model].push(latencyMs); // Keep last 100 samples if (latencyStats[model].length > 100) { latencyStats[model].shift(); } }

  1. Task-Based Routing

const taskModels = { coding: "anthropic/claude-3-5-sonnet", // Best for code reasoning: "openai/o1-preview", // Best for logic creative: "anthropic/claude-3-5-sonnet", // Best for writing simple: "openai/gpt-4o-mini", // Cheap and fast multimodal: "google/gemini-pro-1.5", // Vision + text };

function routeByTask(task: keyof typeof taskModels): string { return taskModels[task]; }

  1. Hybrid Routing

interface RoutingConfig { task: string; maxCost: number; maxLatency: number; }

function hybridRoute(config: RoutingConfig): string { // Filter by cost const affordable = models.filter(m => m.cost <= config.maxCost);

// Filter by latency const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

// Select best for task const taskScores = fast.map(m => ({ model: m.id, score: getTaskScore(m.id, config.task), }));

return taskScores.sort((a, b) => b.score - a.score)[0].model; }

Best Practices

  1. Always Have Fallbacks

// Bad: Single point of failure const response = await openai.chat({ model: "gpt-4o", messages });

// Good: Fallback chain const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"]; for (const model of models) { try { return await gateway.chat({ model, messages }); } catch (e) { continue; } }

  1. Pin Model Versions

// Bad: Model can change const model = "gpt-4";

// Good: Pinned version const model = "openai/gpt-4-0125-preview";

  1. Track Costs

// Log every call async function trackedCall(model: string, messages: Message[]) { const start = Date.now(); const response = await gateway.chat({ model, messages }); const latency = Date.now() - start;

await analytics.track("llm_call", { model, inputTokens: response.usage.prompt_tokens, outputTokens: response.usage.completion_tokens, cost: calculateCost(model, response.usage), latency, });

return response; }

  1. Set Token Limits

// Prevent runaway costs const response = await gateway.chat({ model, messages, max_tokens: 500, // Limit output length });

  1. Use Caching

// LiteLLM caching litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour

References

  • references/openrouter-guide.md

  • OpenRouter deep dive

  • references/litellm-guide.md

  • LiteLLM self-hosting

  • references/routing-strategies.md

  • Advanced routing patterns

  • references/alternatives.md

  • Helicone, Portkey, etc.

Templates

  • templates/openrouter-config.ts

  • TypeScript OpenRouter setup

  • templates/litellm-config.yaml

  • LiteLLM proxy config

  • templates/fallback-chain.ts

  • Fallback implementation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pencil-renderer

No summary provided by upstream source.

Repository SourceNeeds Review
General

ui-skills

No summary provided by upstream source.

Repository SourceNeeds Review
General

documentation-standards

No summary provided by upstream source.

Repository SourceNeeds Review