mistral-performance-tuning

Mistral AI Performance Tuning

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mistral-performance-tuning" with this command: npx skills add jeremylongshore/claude-code-plugins-plus-skills/jeremylongshore-claude-code-plugins-plus-skills-mistral-performance-tuning

Mistral AI Performance Tuning

Overview

Optimize Mistral AI API response times and throughput for production integrations. Key performance factors include model selection (mistral-small: ~200-500ms, mistral-large: ~500-2000ms), prompt length (directly affects time-to-first-token), streaming vs non-streaming (streaming gives perceived speed), and concurrent request management against per-key rate limits.

Prerequisites

  • Mistral API integration in production

  • Understanding of per-endpoint rate limits (RPM and TPM)

  • Application architecture that supports streaming responses

Instructions

Step 1: Choose the Right Model for Latency Requirements

// Model selection by latency budget const MODEL_BY_LATENCY: Record<string, { model: string; typicalMs: string }> = { 'realtime_chat': { model: 'mistral-small-latest', typicalMs: '200-500ms' }, # HTTP 200 OK 'code_completion': { model: 'codestral-latest', typicalMs: '150-400ms' }, 'background_analysis': { model: 'mistral-large-latest', typicalMs: '500-2000ms' }, # HTTP 500 Internal Server Error 'embeddings': { model: 'mistral-embed', typicalMs: '50-150ms' }, };

function selectModel(useCase: string): string { return MODEL_BY_LATENCY[useCase]?.model || 'mistral-small-latest'; }

Step 2: Enable Streaming for User-Facing Responses

import Mistral from '@mistralai/mistralai';

const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });

// Streaming delivers first tokens in ~200ms vs waiting 1-2s for full response async function* streamChat(model: string, messages: any[]) { const stream = await client.chat.stream({ model, messages }); for await (const chunk of stream) { const content = chunk.data.choices[0]?.delta?.content; if (content) yield content; } } // TTFT (time to first token) drops from 500-2000ms to ~200ms with streaming # HTTP 500 Internal Server Error

Step 3: Cache Identical Requests

import { createHash } from 'crypto'; import { LRUCache } from 'lru-cache';

const responseCache = new LRUCache<string, any>({ max: 5000, ttl: 3600_000 }); # 5000: 5 seconds in ms

async function cachedCompletion(model: string, messages: any[], temperature: number = 0) { // Only cache deterministic requests (temperature=0) if (temperature > 0) return client.chat.complete({ model, messages, temperature });

const key = createHash('md5').update(JSON.stringify({ model, messages })).digest('hex'); const cached = responseCache.get(key); if (cached) return cached;

const result = await client.chat.complete({ model, messages, temperature: 0 }); responseCache.set(key, result); return result; }

Step 4: Optimize Prompt Length

// Reduce input tokens to decrease TTFT and total latency const OPTIMIZATION = { // Keep system prompts concise systemPrompt: 'You are a helpful assistant. Be brief.', // ~10 tokens, not 200 # HTTP 200 OK

// Limit context window usage maxContextTokens: 4000, // Don't fill 32K context when 4K suffices # 4000: dev server port

// Trim conversation history maxHistoryTurns: 5, // Keep last 5 turns, not entire conversation };

function trimMessages(messages: any[], maxTurns: number = 5): any[] { const system = messages.filter(m => m.role === 'system'); const history = messages.filter(m => m.role !== 'system').slice(-maxTurns * 2); return [...system, ...history]; }

Step 5: Manage Concurrent Requests

import PQueue from 'p-queue';

// Respect RPM limits while maximizing throughput const requestQueue = new PQueue({ concurrency: 10, // Max parallel requests interval: 60_000, // Per minute intervalCap: 100, // RPM limit for your key });

async function queuedCompletion(model: string, messages: any[]) { return requestQueue.add(() => client.chat.complete({ model, messages })); }

Error Handling

Issue Cause Solution

429 rate_limit_exceeded

RPM or TPM cap hit Use PQueue with RPM limit, add exponential backoff

High TTFT (>1s) Prompt too long or model too large Trim prompt, use mistral-small for latency-sensitive

Streaming connection dropped Network timeout Implement reconnection with resume from last chunk

Cache ineffective High temperature (non-deterministic) Only cache temperature=0 requests

Examples

Basic usage: Apply mistral performance tuning to a standard project setup with default configuration options.

Advanced scenario: Customize mistral performance tuning for production environments with multiple constraints and team-specific requirements.

Output

  • Configuration files or code changes applied to the project

  • Validation report confirming correct implementation

  • Summary of changes made and their rationale

Resources

  • Official ORM documentation

  • Community best practices and patterns

  • Related skills in this plugin pack

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

backtesting-trading-strategies

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

svg-icon-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

performance-lighthouse-runner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mindmap-generator

No summary provided by upstream source.

Repository SourceNeeds Review