Production Deployment
Complete production deployment guide for ElevenLabs API integration including rate limiting patterns, comprehensive error handling strategies, monitoring setup, and testing frameworks.
Overview
This skill provides battle-tested patterns for deploying ElevenLabs API integration to production environments with:
-
Rate Limiting: Concurrency-aware rate limiting respecting plan limits
-
Error Handling: Comprehensive error recovery and retry strategies
-
Monitoring: Real-time metrics, logging, and alerting
-
Testing: Load testing, concurrency validation, and production readiness checks
Quick Start
- Setup Monitoring Infrastructure
bash scripts/setup-monitoring.sh --project-name "my-elevenlabs-app"
--log-level "info"
--metrics-port 9090
This script:
-
Configures Winston logging with rotation
-
Sets up Prometheus metrics endpoints
-
Creates health check endpoints
-
Initializes error tracking
- Deploy Production Configuration
bash scripts/deploy-production.sh --environment "production"
--api-key "$ELEVENLABS_API_KEY"
--concurrency-limit 10
--region "us-east-1"
This script:
-
Validates environment variables
-
Applies rate limiting configuration
-
Configures error handling middleware
-
Sets up monitoring integrations
-
Performs smoke tests
- Test Rate Limiting
bash scripts/test-rate-limiting.sh --concurrency 20
--duration 60
--plan-tier "pro"
This script:
-
Simulates concurrent requests
-
Validates queue behavior
-
Measures latency under load
-
Generates performance report
ElevenLabs Concurrency Limits
Limits by Plan Tier
Plan Multilingual v2 Turbo/Flash STT Music
Free 2 4 8 N/A
Starter 3 6 12 2
Creator 5 10 20 2
Pro 10 20 40 2
Scale 15 30 60 3
Business 15 30 60 3
Enterprise Elevated Elevated Elevated Highest
Queue Management
When concurrency limits are exceeded:
-
Requests are queued alongside lower-priority requests
-
Typical latency increase: ~50ms
-
Response headers include: current-concurrent-requests , maximum-concurrent-requests
Real-World Capacity
A concurrency limit of 5 can typically support ~100 simultaneous audio broadcasts depending on:
-
Audio generation speed
-
User behavior patterns
-
Request distribution
Rate Limiting Patterns
- Token Bucket Algorithm
Best for: Variable rate limiting with burst capacity
// See templates/rate-limiter.js.template for full implementation const limiter = new TokenBucketRateLimiter({ capacity: 10, // Max concurrent requests refillRate: 2, // Tokens per second queueSize: 100 // Max queued requests });
- Sliding Window with Priority Queue
Best for: Enforcing strict concurrency limits with prioritization
See templates/error-handler.py.template for full implementation
limiter = SlidingWindowRateLimiter( max_concurrent=10 window_size=60 priority_levels=3 )
- Adaptive Rate Limiting
Best for: Self-adjusting to API response headers
Monitors current-concurrent-requests and maximum-concurrent-requests headers to dynamically adjust rate limits.
Error Handling Strategies
Error Categories
- Rate Limit Errors (429)
-
Implement exponential backoff
-
Queue requests for retry
-
Monitor queue depth
- Service Errors (500-599)
-
Retry with exponential backoff
-
Circuit breaker pattern
-
Fallback to cached audio
- Client Errors (400-499)
-
Log for debugging
-
Do not retry
-
Return meaningful error to user
- Network Errors
-
Retry with linear backoff
-
Timeout after 30 seconds
-
Circuit breaker after 5 failures
Circuit Breaker Pattern
// Automatically opens circuit after threshold failures const circuitBreaker = new CircuitBreaker({ failureThreshold: 5 resetTimeout: 60000 monitorInterval: 5000 });
Monitoring Setup
Key Metrics to Track
Request Metrics:
-
elevenlabs_requests_total
-
Total requests by status
-
elevenlabs_requests_duration_seconds
-
Request latency histogram
-
elevenlabs_concurrent_requests
-
Current concurrent requests
-
elevenlabs_queue_depth
-
Queued requests waiting
Error Metrics:
-
elevenlabs_errors_total
-
Total errors by type
-
elevenlabs_retries_total
-
Total retry attempts
-
elevenlabs_circuit_breaker_state
-
Circuit breaker state
Business Metrics:
-
elevenlabs_characters_generated
-
Total characters processed
-
elevenlabs_audio_duration_seconds
-
Total audio duration
-
elevenlabs_quota_used_percentage
-
Quota utilization
Logging Best Practices
Structure logs with:
-
Request ID for tracing
-
User ID for analysis
-
Timestamp in ISO 8601
-
Error stack traces
-
Performance metrics
Log Levels:
-
error
-
Failures requiring attention
-
warn
-
Degraded performance, retries
-
info
-
Request completion, key events
-
debug
-
Detailed execution flow
Alerting Rules
Critical Alerts:
-
Error rate > 5% over 5 minutes
-
Circuit breaker open for > 1 minute
-
Queue depth > 500 requests
Warning Alerts:
-
Latency p95 > 2 seconds
-
Quota usage > 90%
-
Retry rate > 20%
Testing Frameworks
Load Testing
Simulate production traffic patterns:
Gradual ramp-up test
bash scripts/test-rate-limiting.sh
--pattern "ramp-up"
--start-rps 1
--end-rps 10
--duration 300
Concurrency Validation
Verify concurrency limits are enforced:
Burst test
bash scripts/test-rate-limiting.sh
--pattern "burst"
--concurrency 50
--iterations 100
Chaos Testing
Test error handling under adverse conditions:
Simulate API failures
bash scripts/test-rate-limiting.sh
--pattern "chaos"
--failure-rate 0.1
--duration 120
Production Checklist
Pre-Deployment
-
Environment variables configured
-
Rate limiting configured for plan tier
-
Error handling middleware implemented
-
Monitoring and logging configured
-
Health check endpoints created
-
Load testing completed
-
Chaos testing completed
Post-Deployment
-
Smoke tests passed
-
Metrics dashboard configured
-
Alerts configured and tested
-
On-call rotation established
-
Runbooks documented
-
Backup/fallback strategy tested
Scripts
setup-monitoring.sh
Configures comprehensive monitoring infrastructure:
-
Winston logging with daily rotation
-
Prometheus metrics exporter
-
Health check endpoints
-
Error tracking integration
-
Custom metric collectors
Usage:
bash scripts/setup-monitoring.sh
--project-name "my-app"
--log-level "info"
--metrics-port 9090
--health-port 8080
deploy-production.sh
Production deployment orchestration:
-
Environment validation
-
Dependency installation
-
Configuration deployment
-
Service health checks
-
Smoke test execution
-
Rollback on failure
Usage:
bash scripts/deploy-production.sh
--environment "production"
--api-key "$ELEVENLABS_API_KEY"
--concurrency-limit 10
--skip-tests false
test-rate-limiting.sh
Comprehensive rate limiting test suite:
-
Concurrency limit validation
-
Queue behavior testing
-
Latency measurement
-
Error rate tracking
-
Performance reporting
Usage:
bash scripts/test-rate-limiting.sh
--concurrency 20
--duration 60
--plan-tier "pro"
--pattern "ramp-up"
validate-config.sh
Production configuration validator:
-
Environment variable checks
-
API key validation
-
Rate limit configuration
-
Monitoring setup verification
-
Security audit
Usage:
bash scripts/validate-config.sh
--config-file "config/production.json"
--strict true
rollback.sh
Automated rollback script:
-
Reverts to previous deployment
-
Restores configuration
-
Validates health checks
-
Notifies team
Usage:
bash scripts/rollback.sh
--deployment-id "deploy-123"
--reason "High error rate"
Templates
rate-limiter.js.template
Token bucket rate limiter with priority queue:
-
Configurable capacity and refill rate
-
Priority-based request queuing
-
Automatic backpressure handling
-
Prometheus metrics integration
rate-limiter.py.template
Sliding window rate limiter with async support:
-
Strict concurrency enforcement
-
Redis-backed for distributed systems
-
Circuit breaker integration
-
Comprehensive error handling
error-handler.js.template
Production-grade error handler:
-
Error categorization and routing
-
Exponential backoff retry logic
-
Circuit breaker pattern
-
Structured error logging
error-handler.py.template
Async error handler with context:
-
Context-aware error handling
-
Retry with jitter
-
Error aggregation and reporting
-
Integration with monitoring
monitoring-config.json.template
Complete monitoring configuration:
-
Prometheus scrape configs
-
Alert rules and thresholds
-
Log aggregation settings
-
Dashboard definitions
health-check.js.template
Comprehensive health check endpoint:
-
API connectivity verification
-
Rate limiter health
-
Queue depth monitoring
-
Dependency checks
Examples
Rate Limiting Example
Complete implementation showing:
-
Token bucket rate limiter
-
Priority queue management
-
Backpressure handling
-
Metrics collection
Location: examples/rate-limiting/
Error Handling Example
Production error handling patterns:
-
Retry with exponential backoff
-
Circuit breaker implementation
-
Fallback strategies
-
Error logging and alerting
Location: examples/error-handling/
Monitoring Example
Full monitoring stack setup:
-
Prometheus metrics
-
Grafana dashboards
-
Winston logging
-
Alert manager configuration
Location: examples/monitoring/
Best Practices
Rate Limiting
-
Configure for your plan tier - Don't exceed concurrency limits
-
Implement graceful degradation - Queue requests, don't drop
-
Monitor queue depth - Alert on excessive queueing
-
Use adaptive limiting - Adjust based on response headers
-
Test under load - Validate behavior before production
Error Handling
-
Categorize errors - Different strategies for different error types
-
Implement retries carefully - Exponential backoff with jitter
-
Use circuit breakers - Prevent cascade failures
-
Log comprehensively - Include context for debugging
-
Provide fallbacks - Cached audio, degraded experience
Monitoring
-
Track key metrics - Request rate, latency, errors, concurrency
-
Set meaningful alerts - Actionable, not noisy
-
Use structured logging - JSON format for easy parsing
-
Create dashboards - Real-time visibility
-
Test alerts - Verify notification channels work
Testing
-
Load test gradually - Ramp up to avoid overwhelming API
-
Simulate realistic patterns - User behavior, not raw requests
-
Test error scenarios - Chaos engineering
-
Validate concurrency - Ensure limits are enforced
-
Monitor during tests - Use production monitoring stack
Troubleshooting
High Error Rate
Symptoms: Error rate > 5%
Diagnosis:
-
Check Prometheus metrics: rate(elevenlabs_errors_total[5m])
-
Review error logs for patterns
-
Verify API key is valid
-
Check quota remaining
Resolution:
-
If rate limiting: Reduce request rate or upgrade plan
-
If service errors: Implement circuit breaker, contact support
-
If client errors: Fix request validation
High Latency
Symptoms: p95 latency > 2 seconds
Diagnosis:
-
Check concurrency: elevenlabs_concurrent_requests
-
Check queue depth: elevenlabs_queue_depth
-
Review response headers: current-concurrent-requests
Resolution:
-
Increase concurrency limit (upgrade plan if needed)
-
Optimize request payload size
-
Implement request coalescing
-
Use Turbo/Flash models for lower latency
Circuit Breaker Open
Symptoms: Requests failing immediately
Diagnosis:
-
Check circuit breaker state metric
-
Review error logs for failure pattern
-
Check ElevenLabs status page
Resolution:
-
Wait for automatic reset (default 60s)
-
If persistent: Check API connectivity
-
Manual reset if resolved: Restart service
Resources
-
ElevenLabs API Docs
-
Concurrency Documentation
-
API Reference
-
Status Page
Contributing
When updating this skill:
-
Test scripts thoroughly in production-like environment
-
Update templates with latest best practices
-
Add examples for new patterns
-
Update troubleshooting guide
-
Validate with validate-skill.sh
Version: 1.0.0 Last Updated: 2025-10-29 Maintainer: ElevenLabs Plugin Team