LangChain Performance Tuning

Overview

Optimize LangChain applications for lower latency, higher throughput, and efficient resource utilization.

Prerequisites

Working LangChain application
Performance baseline measurements
Profiling tools available

Instructions

Step 1: Measure Baseline Performance

import time from functools import wraps from typing import Callable import statistics

def benchmark(func: Callable, iterations: int = 10): """Benchmark a function's performance.""" times = [] for _ in range(iterations): start = time.perf_counter() func() elapsed = time.perf_counter() - start times.append(elapsed)

return {
    "mean": statistics.mean(times),
    "median": statistics.median(times),
    "stdev": statistics.stdev(times) if len(times) > 1 else 0,
    "min": min(times),
    "max": max(times),
}

Usage

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def test_call(): llm.invoke("Hello!")

results = benchmark(test_call, iterations=5) print(f"Mean latency: {results['mean']:.3f}s")

Step 2: Enable Response Caching

from langchain_core.globals import set_llm_cache from langchain_community.cache import InMemoryCache, SQLiteCache, RedisCache

Option 1: In-memory cache (single process)

set_llm_cache(InMemoryCache())

Option 2: SQLite cache (persistent, single node)

set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))

Option 3: Redis cache (distributed, production)

import redis redis_client = redis.Redis.from_url("redis://localhost:6379") # 6379: Redis port set_llm_cache(RedisCache(redis_client))

Cache hit = ~0ms latency vs ~500-2000ms for API call

Step 3: Optimize Batch Processing

import asyncio from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini") prompt = ChatPromptTemplate.from_template("{input}") chain = prompt | llm

Sequential (slow)

def process_sequential(inputs: list) -> list: return [chain.invoke({"input": inp}) for inp in inputs]

Batch (faster - automatic batching)

def process_batch(inputs: list) -> list: batch_inputs = [{"input": inp} for inp in inputs] return chain.batch(batch_inputs, config={"max_concurrency": 10})

Async (fastest - true parallelism)

async def process_async(inputs: list) -> list: batch_inputs = [{"input": inp} for inp in inputs] return await chain.abatch(batch_inputs, config={"max_concurrency": 20})

Benchmark: 10 items

Sequential: ~10s (1s each)

Batch: ~2s (parallel API calls)

Async: ~1.5s (optimal parallelism)

Step 4: Use Streaming for Perceived Performance

from langchain_openai import ChatOpenAI

Non-streaming: User waits for full response

llm = ChatOpenAI(model="gpt-4o-mini") response = llm.invoke("Tell me a story") # Wait 2-3 seconds

Streaming: First token in ~200ms

llm_stream = ChatOpenAI(model="gpt-4o-mini", streaming=True) for chunk in llm_stream.stream("Tell me a story"): print(chunk.content, end="", flush=True)

Step 5: Optimize Prompt Length

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int: """Count tokens in text.""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text))

def optimize_prompt(prompt: str, max_tokens: int = 1000) -> str: # 1 second in ms """Truncate prompt to fit token limit.""" encoding = tiktoken.encoding_for_model("gpt-4o-mini") tokens = encoding.encode(prompt) if len(tokens) <= max_tokens: return prompt return encoding.decode(tokens[:max_tokens])

Example: Long context optimization

system_prompt = "You are a helpful assistant." # ~5 tokens user_context = "Here is the document: " + long_document # Could be 10000+ tokens

Optimize by summarizing or chunking context

Step 6: Connection Pooling

import httpx from langchain_openai import ChatOpenAI

Configure connection pooling for high throughput

transport = httpx.HTTPTransport( retries=3, limits=httpx.Limits( max_connections=100, max_keepalive_connections=20 ) )

Use shared client across requests

http_client = httpx.Client(transport=transport, timeout=30.0)

Note: OpenAI SDK handles this internally, but for custom integrations:

llm = ChatOpenAI( model="gpt-4o-mini", http_client=http_client # Reuse connections )

Step 7: Model Selection Optimization

Match model to task complexity

Fast + Cheap: Simple tasks

llm_fast = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Powerful + Slower: Complex reasoning

llm_powerful = ChatOpenAI(model="gpt-4o", temperature=0)

Router pattern: Choose model based on task

from langchain_core.runnables import RunnableBranch

def classify_complexity(input_dict: dict) -> str: """Classify input complexity.""" text = input_dict.get("input", "") # Simple heuristic - replace with classifier return "complex" if len(text) > 500 else "simple" # HTTP 500 Internal Server Error

router = RunnableBranch( (lambda x: classify_complexity(x) == "simple", prompt | llm_fast), prompt | llm_powerful # Default to powerful )

Performance Metrics

Optimization Latency Improvement Cost Impact

Caching 90-99% on cache hit Major reduction

Batching 50-80% for bulk Neutral

Streaming Perceived 80%+ Neutral

Shorter prompts 10-30% Cost reduction

Connection pooling 5-10% Neutral

Model routing 20-50% Cost reduction

Output

Performance benchmarking setup
Caching implementation
Optimized batch processing
Streaming for perceived performance

Resources

LangChain Caching
OpenAI Latency Guide
tiktoken

Next Steps

Use langchain-cost-tuning to optimize API costs alongside performance.

Error Handling

Error Cause Resolution

Authentication failure Invalid or expired credentials Refresh tokens or re-authenticate with CI/CD

Configuration conflict Incompatible settings detected Review and resolve conflicting parameters

Resource not found Referenced resource missing Verify resource exists and permissions are correct

Examples

Basic usage: Apply langchain performance tuning to a standard project setup with default configuration options.

Advanced scenario: Customize langchain performance tuning for production environments with multiple constraints and team-specific requirements.

langchain-performance-tuning

Safety Notice

Copy this and send it to your AI assistant to learn