Query Optimizer
Optimize SQL queries for better performance through indexing, rewriting, and analysis.
Quick Start
Use EXPLAIN to analyze queries, add indexes on WHERE/JOIN columns, avoid SELECT *, limit results.
Instructions
Query Analysis with EXPLAIN
Basic EXPLAIN:
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
EXPLAIN ANALYZE (actual execution):
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
Key metrics to check:
-
Seq Scan (bad) vs Index Scan (good)
-
Rows: Estimated vs actual
-
Cost: Lower is better
-
Execution time
Common Performance Issues
- Missing Indexes
Problem:
-- Seq Scan on users (cost=0.00..1234.56) SELECT * FROM users WHERE email = 'user@example.com';
Solution:
CREATE INDEX idx_users_email ON users(email); -- Now: Index Scan using idx_users_email
**2. SELECT ***
Problem:
SELECT * FROM posts; -- Fetches all columns
Solution:
SELECT id, title, created_at FROM posts; -- Only needed columns
- N+1 Queries
Problem:
-- Fetches posts SELECT * FROM posts; -- Then for each post: SELECT * FROM users WHERE id = ?;
Solution:
-- Single query with JOIN SELECT posts.*, users.name FROM posts JOIN users ON posts.user_id = users.id;
- No LIMIT
Problem:
SELECT * FROM posts ORDER BY created_at DESC; -- Returns all rows
Solution:
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20;
Indexing Strategies
Single column index:
CREATE INDEX idx_users_email ON users(email);
Composite index (order matters):
-- For: WHERE user_id = ? AND created_at > ? CREATE INDEX idx_posts_user_created ON posts(user_id, created_at);
Covering index (includes all needed columns):
-- For: SELECT id, title FROM posts WHERE user_id = ? CREATE INDEX idx_posts_user_id_title ON posts(user_id) INCLUDE (title);
Partial index (filtered):
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;
Index on expressions:
CREATE INDEX idx_users_lower_email ON users(LOWER(email)); -- For: WHERE LOWER(email) = 'user@example.com'
Query Rewriting
Use EXISTS instead of IN for large sets:
-- Slow SELECT * FROM users WHERE id IN (SELECT user_id FROM posts);
-- Faster SELECT * FROM users u WHERE EXISTS ( SELECT 1 FROM posts p WHERE p.user_id = u.id );
Use JOIN instead of subquery:
-- Slow SELECT * FROM posts WHERE user_id IN ( SELECT id FROM users WHERE is_active = true );
-- Faster SELECT p.* FROM posts p JOIN users u ON p.user_id = u.id WHERE u.is_active = true;
Avoid functions on indexed columns:
-- Bad: Can't use index SELECT * FROM users WHERE YEAR(created_at) = 2024;
-- Good: Can use index SELECT * FROM users WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
Use UNION ALL instead of UNION:
-- Slow: Removes duplicates SELECT id FROM posts UNION SELECT id FROM drafts;
-- Fast: No duplicate removal SELECT id FROM posts UNION ALL SELECT id FROM drafts;
JOIN Optimization
Order matters - filter early:
-- Bad: Large intermediate result SELECT * FROM posts p JOIN users u ON p.user_id = u.id WHERE p.created_at > '2024-01-01';
-- Good: Filter first SELECT * FROM posts p WHERE p.created_at > '2024-01-01' JOIN users u ON p.user_id = u.id;
Use appropriate JOIN type:
-- INNER JOIN: Only matching rows SELECT * FROM posts p INNER JOIN users u ON p.user_id = u.id;
-- LEFT JOIN: All posts, even without user SELECT * FROM posts p LEFT JOIN users u ON p.user_id = u.id;
Index JOIN columns:
CREATE INDEX idx_posts_user_id ON posts(user_id); CREATE INDEX idx_users_id ON users(id); -- Usually PK already indexed
Pagination Optimization
Offset pagination (slow for large offsets):
-- Slow for page 1000 SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 20000;
Cursor pagination (faster):
-- First page SELECT * FROM posts ORDER BY created_at DESC, id DESC LIMIT 20;
-- Next page (using last created_at and id) SELECT * FROM posts WHERE (created_at, id) < ('2024-01-01 12:00:00', 12345) ORDER BY created_at DESC, id DESC LIMIT 20;
Aggregation Optimization
Use indexes for GROUP BY:
CREATE INDEX idx_posts_user_id ON posts(user_id);
SELECT user_id, COUNT(*) FROM posts GROUP BY user_id;
Filter before aggregating:
-- Good SELECT user_id, COUNT(*) FROM posts WHERE created_at > '2024-01-01' GROUP BY user_id;
Use HAVING for aggregate filters:
SELECT user_id, COUNT() as post_count FROM posts GROUP BY user_id HAVING COUNT() > 10;
Subquery Optimization
Correlated subqueries (slow):
-- Bad: Runs subquery for each row SELECT * FROM users u WHERE (SELECT COUNT(*) FROM posts WHERE user_id = u.id) > 10;
JOIN instead:
-- Good: Single query SELECT u.* FROM users u JOIN ( SELECT user_id, COUNT() as post_count FROM posts GROUP BY user_id HAVING COUNT() > 10 ) p ON u.id = p.user_id;
Caching Strategies
Materialized views:
CREATE MATERIALIZED VIEW user_post_counts AS SELECT user_id, COUNT(*) as post_count FROM posts GROUP BY user_id;
-- Refresh periodically REFRESH MATERIALIZED VIEW user_post_counts;
Query result caching (application level):
Cache expensive queries
@cache(ttl=300) def get_popular_posts(): return db.query("SELECT * FROM posts ORDER BY views DESC LIMIT 10")
Common Patterns
Full-text Search
PostgreSQL:
-- Add tsvector column ALTER TABLE posts ADD COLUMN search_vector tsvector;
-- Update with trigger CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);
-- Search SELECT * FROM posts WHERE search_vector @@ to_tsquery('postgresql & optimization');
Use dedicated search engine for complex needs:
-
Elasticsearch
-
Algolia
-
Meilisearch
Batch Operations
Bulk insert:
-- Bad: Multiple inserts INSERT INTO users (name) VALUES ('User 1'); INSERT INTO users (name) VALUES ('User 2');
-- Good: Single insert INSERT INTO users (name) VALUES ('User 1'), ('User 2'), ('User 3');
Bulk update:
-- Use CASE for conditional updates UPDATE posts SET status = CASE WHEN views > 1000 THEN 'popular' WHEN views > 100 THEN 'normal' ELSE 'new' END;
Connection Pooling
Use connection pool
from sqlalchemy import create_engine
engine = create_engine( 'postgresql://user:pass@localhost/db', pool_size=20, max_overflow=10 )
Performance Monitoring
Check slow queries:
-- PostgreSQL: Enable slow query log ALTER DATABASE mydb SET log_min_duration_statement = 1000; -- 1 second
-- View pg_stat_statements SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;
Check index usage:
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE idx_scan = 0 -- Unused indexes ORDER BY pg_relation_size(indexrelid) DESC;
Check table statistics:
SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch FROM pg_stat_user_tables ORDER BY seq_scan DESC;
Best Practices
Always:
-
Use EXPLAIN ANALYZE for slow queries
-
Index foreign keys
-
Index WHERE/JOIN columns
-
Limit result sets
-
Use prepared statements
Avoid:
-
SELECT *
-
Functions on indexed columns in WHERE
-
Correlated subqueries
-
Large OFFSET values
-
Over-indexing
Monitor:
-
Slow query log
-
Index usage
-
Table statistics
-
Connection pool
Troubleshooting
Query still slow after indexing:
-
Check if index is being used (EXPLAIN)
-
Verify index column order for composite indexes
-
Consider covering index
-
Check for stale statistics (ANALYZE table)
Too many indexes:
-
Remove unused indexes
-
Combine similar indexes
-
Monitor write performance
High memory usage:
-
Reduce work_mem
-
Optimize sort operations
-
Use streaming instead of loading all data