groq-performance-tuning

Groq Performance Tuning

Overview

Maximize Groq's ultra-low-latency LPU inference. Groq delivers sub-100ms token generation; tuning focuses on streaming efficiency, prompt caching, model selection for speed vs quality, and parallel request orchestration.

Prerequisites

Groq API key with rate limit awareness
groq-sdk npm package installed
Understanding of LLM token economics
Monitoring for TTFT (time to first token)

Instructions

Step 1: Select Optimal Model for Speed

import Groq from 'groq-sdk';

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

// Model speed tiers (approximate TTFT): // llama-3.3-70b-versatile: ~200ms TTFT, best quality // llama-3.1-8b-instant: ~80ms TTFT, fastest // mixtral-8x7b-32768: ~150ms TTFT, long context # 32768 = configured value

async function fastCompletion(prompt: string) { return groq.chat.completions.create({ model: 'llama-3.1-8b-instant', // Fastest model messages: [{ role: 'user', content: prompt }], temperature: 0, // Deterministic = cacheable max_tokens: 256, // Limit output for speed # 256 bytes }); }

Step 2: Streaming for Perceived Performance

async function streamCompletion( messages: any[], onToken: (token: string) => void ) { const stream = await groq.chat.completions.create({ model: 'llama-3.3-70b-versatile', messages, stream: true, max_tokens: 1024, # 1024: 1 KB });

let fullResponse = ''; for await (const chunk of stream) { const token = chunk.choices[0]?.delta?.content || ''; fullResponse += token; onToken(token); } return fullResponse; }

Step 3: Semantic Prompt Cache

import { LRUCache } from 'lru-cache'; import { createHash } from 'crypto';

const promptCache = new LRUCache<string, string>({ max: 500, # HTTP 500 Internal Server Error ttl: 1000 * 60 * 10, // 10 min for deterministic prompts # 1000: 1 second in ms });

function hashPrompt(messages: any[], model: string): string { return createHash('sha256') .update(JSON.stringify({ messages, model })) .digest('hex'); }

async function cachedCompletion(messages: any[], model: string) { const key = hashPrompt(messages, model); const cached = promptCache.get(key); if (cached) return cached;

const response = await groq.chat.completions.create({ model, messages, temperature: 0, });

const result = response.choices[0].message.content!; promptCache.set(key, result); return result; }

Step 4: Parallel Request Orchestration

async function parallelCompletions( prompts: string[], concurrency = 5 ) { const results: string[] = [];

for (let i = 0; i < prompts.length; i += concurrency) { const batch = prompts.slice(i, i + concurrency); const batchResults = await Promise.all( batch.map(prompt => cachedCompletion( [{ role: 'user', content: prompt }], 'llama-3.1-8b-instant' ) ) ); results.push(...batchResults); } return results; }

Error Handling

Issue Cause Solution

Rate limit 429 Over RPM/TPM quota Use exponential backoff, batch requests

High TTFT Using 70b model Switch to 8b-instant for latency-sensitive tasks

Stream disconnect Network timeout Implement reconnection with partial response recovery

Token overflow max_tokens too high Set conservative limits, truncate prompts

Examples

Latency Benchmark

async function benchmarkModels(prompt: string) { const models = ['llama-3.1-8b-instant', 'llama-3.3-70b-versatile'];

for (const model of models) { const start = performance.now(); await groq.chat.completions.create({ model, messages: [{ role: 'user', content: prompt }], max_tokens: 100, }); console.log(${model}: ${(performance.now() - start).toFixed(0)}ms); } }

Resources

Groq API Documentation
Groq Rate Limits
Groq Model Cards

Output

Configuration files or code changes applied to the project
Validation report confirming correct implementation
Summary of changes made and their rationale

groq-performance-tuning

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

backtesting-trading-strategies

svg-icon-generator

performance-lighthouse-runner

mindmap-generator