langchain-performance-tuning

LangChain Performance Tuning

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "langchain-performance-tuning" with this command: npx skills add jeremylongshore/claude-code-plugins-plus-skills/jeremylongshore-claude-code-plugins-plus-skills-langchain-performance-tuning

LangChain Performance Tuning

Overview

Optimize LangChain applications for lower latency, higher throughput, and efficient resource utilization.

Prerequisites

  • Working LangChain application

  • Performance baseline measurements

  • Profiling tools available

Instructions

Step 1: Measure Baseline Performance

import time from functools import wraps from typing import Callable import statistics

def benchmark(func: Callable, iterations: int = 10): """Benchmark a function's performance.""" times = [] for _ in range(iterations): start = time.perf_counter() func() elapsed = time.perf_counter() - start times.append(elapsed)

return {
    "mean": statistics.mean(times),
    "median": statistics.median(times),
    "stdev": statistics.stdev(times) if len(times) > 1 else 0,
    "min": min(times),
    "max": max(times),
}

Usage

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def test_call(): llm.invoke("Hello!")

results = benchmark(test_call, iterations=5) print(f"Mean latency: {results['mean']:.3f}s")

Step 2: Enable Response Caching

from langchain_core.globals import set_llm_cache from langchain_community.cache import InMemoryCache, SQLiteCache, RedisCache

Option 1: In-memory cache (single process)

set_llm_cache(InMemoryCache())

Option 2: SQLite cache (persistent, single node)

set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))

Option 3: Redis cache (distributed, production)

import redis redis_client = redis.Redis.from_url("redis://localhost:6379") # 6379: Redis port set_llm_cache(RedisCache(redis_client))

Cache hit = ~0ms latency vs ~500-2000ms for API call

Step 3: Optimize Batch Processing

import asyncio from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini") prompt = ChatPromptTemplate.from_template("{input}") chain = prompt | llm

Sequential (slow)

def process_sequential(inputs: list) -> list: return [chain.invoke({"input": inp}) for inp in inputs]

Batch (faster - automatic batching)

def process_batch(inputs: list) -> list: batch_inputs = [{"input": inp} for inp in inputs] return chain.batch(batch_inputs, config={"max_concurrency": 10})

Async (fastest - true parallelism)

async def process_async(inputs: list) -> list: batch_inputs = [{"input": inp} for inp in inputs] return await chain.abatch(batch_inputs, config={"max_concurrency": 20})

Benchmark: 10 items

Sequential: ~10s (1s each)

Batch: ~2s (parallel API calls)

Async: ~1.5s (optimal parallelism)

Step 4: Use Streaming for Perceived Performance

from langchain_openai import ChatOpenAI

Non-streaming: User waits for full response

llm = ChatOpenAI(model="gpt-4o-mini") response = llm.invoke("Tell me a story") # Wait 2-3 seconds

Streaming: First token in ~200ms

llm_stream = ChatOpenAI(model="gpt-4o-mini", streaming=True) for chunk in llm_stream.stream("Tell me a story"): print(chunk.content, end="", flush=True)

Step 5: Optimize Prompt Length

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int: """Count tokens in text.""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text))

def optimize_prompt(prompt: str, max_tokens: int = 1000) -> str: # 1 second in ms """Truncate prompt to fit token limit.""" encoding = tiktoken.encoding_for_model("gpt-4o-mini") tokens = encoding.encode(prompt) if len(tokens) <= max_tokens: return prompt return encoding.decode(tokens[:max_tokens])

Example: Long context optimization

system_prompt = "You are a helpful assistant." # ~5 tokens user_context = "Here is the document: " + long_document # Could be 10000+ tokens

Optimize by summarizing or chunking context

Step 6: Connection Pooling

import httpx from langchain_openai import ChatOpenAI

Configure connection pooling for high throughput

transport = httpx.HTTPTransport( retries=3, limits=httpx.Limits( max_connections=100, max_keepalive_connections=20 ) )

Use shared client across requests

http_client = httpx.Client(transport=transport, timeout=30.0)

Note: OpenAI SDK handles this internally, but for custom integrations:

llm = ChatOpenAI( model="gpt-4o-mini", http_client=http_client # Reuse connections )

Step 7: Model Selection Optimization

Match model to task complexity

Fast + Cheap: Simple tasks

llm_fast = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Powerful + Slower: Complex reasoning

llm_powerful = ChatOpenAI(model="gpt-4o", temperature=0)

Router pattern: Choose model based on task

from langchain_core.runnables import RunnableBranch

def classify_complexity(input_dict: dict) -> str: """Classify input complexity.""" text = input_dict.get("input", "") # Simple heuristic - replace with classifier return "complex" if len(text) > 500 else "simple" # HTTP 500 Internal Server Error

router = RunnableBranch( (lambda x: classify_complexity(x) == "simple", prompt | llm_fast), prompt | llm_powerful # Default to powerful )

Performance Metrics

Optimization Latency Improvement Cost Impact

Caching 90-99% on cache hit Major reduction

Batching 50-80% for bulk Neutral

Streaming Perceived 80%+ Neutral

Shorter prompts 10-30% Cost reduction

Connection pooling 5-10% Neutral

Model routing 20-50% Cost reduction

Output

  • Performance benchmarking setup

  • Caching implementation

  • Optimized batch processing

  • Streaming for perceived performance

Resources

  • LangChain Caching

  • OpenAI Latency Guide

  • tiktoken

Next Steps

Use langchain-cost-tuning to optimize API costs alongside performance.

Error Handling

Error Cause Resolution

Authentication failure Invalid or expired credentials Refresh tokens or re-authenticate with CI/CD

Configuration conflict Incompatible settings detected Review and resolve conflicting parameters

Resource not found Referenced resource missing Verify resource exists and permissions are correct

Examples

Basic usage: Apply langchain performance tuning to a standard project setup with default configuration options.

Advanced scenario: Customize langchain performance tuning for production environments with multiple constraints and team-specific requirements.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Web3

tracking-crypto-prices

No summary provided by upstream source.

Repository SourceNeeds Review
Web3

aggregating-crypto-news

No summary provided by upstream source.

Repository SourceNeeds Review
Web3

tracking-crypto-derivatives

No summary provided by upstream source.

Repository SourceNeeds Review
Web3

tracking-crypto-portfolio

No summary provided by upstream source.

Repository SourceNeeds Review