Database Query Optimizer
Analyzes database queries, interprets EXPLAIN plans, suggests indexes, and detects common performance issues like N+1 queries.
When to Use
-
"Optimize my database query"
-
"Analyze EXPLAIN plan"
-
"Why is my query slow?"
-
"Suggest indexes"
-
"Fix N+1 queries"
-
"Improve database performance"
Instructions
- PostgreSQL Query Analysis
Run EXPLAIN:
EXPLAIN ANALYZE SELECT u.name, COUNT(p.id) as post_count FROM users u LEFT JOIN posts p ON u.id = p.user_id WHERE u.created_at > '2024-01-01' GROUP BY u.id, u.name ORDER BY post_count DESC LIMIT 10;
Interpret EXPLAIN output:
QUERY PLAN
Limit (cost=1234.56..1234.58 rows=10 width=40) (actual time=45.123..45.125 rows=10 loops=1) -> Sort (cost=1234.56..1345.67 rows=44444 width=40) (actual time=45.122..45.123 rows=10 loops=1) Sort Key: (count(p.id)) DESC Sort Method: top-N heapsort Memory: 25kB -> HashAggregate (cost=1000.00..1200.00 rows=44444 width=40) (actual time=40.456..42.789 rows=45000 loops=1) Group Key: u.id -> Hash Left Join (cost=100.00..900.00 rows=50000 width=32) (actual time=1.234..35.678 rows=100000 loops=1) Hash Cond: (p.user_id = u.id) -> Seq Scan on posts p (cost=0.00..500.00 rows=50000 width=4) (actual time=0.010..10.234 rows=50000 loops=1) -> Hash (cost=75.00..75.00 rows=2000 width=32) (actual time=1.200..1.200 rows=2000 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 125kB -> Seq Scan on users u (cost=0.00..75.00 rows=2000 width=32) (actual time=0.005..0.678 rows=2000 loops=1) Filter: (created_at > '2024-01-01'::date) Rows Removed by Filter: 500 Planning Time: 0.234 ms Execution Time: 45.234 ms
Key metrics to analyze:
-
cost: Estimated cost (first number = startup, second = total)
-
rows: Estimated rows returned
-
width: Average row size in bytes
-
actual time: Real execution time (ms)
-
loops: Number of times node executed
Red flags:
-
Sequential Scan on large tables
-
High cost values
-
Rows estimate far from actual
-
Multiple loops
-
Slow execution time
- Optimization Strategies
Add Index:
-- Create index on filtered column CREATE INDEX idx_users_created_at ON users(created_at);
-- Create index on join column CREATE INDEX idx_posts_user_id ON posts(user_id);
-- Composite index for specific query pattern CREATE INDEX idx_users_created_name ON users(created_at, name);
-- Partial index for common filter CREATE INDEX idx_users_recent ON users(created_at) WHERE created_at > '2024-01-01';
-- Covering index (includes all needed columns) CREATE INDEX idx_users_covering ON users(id, name, created_at);
Rewrite Query:
-- ❌ BAD: Subquery in SELECT SELECT u.name, (SELECT COUNT(*) FROM posts WHERE user_id = u.id) as post_count FROM users u;
-- ✅ GOOD: Use JOIN SELECT u.name, COUNT(p.id) as post_count FROM users u LEFT JOIN posts p ON u.id = p.user_id GROUP BY u.id, u.name;
-- ❌ BAD: OR conditions SELECT * FROM users WHERE email = 'test@example.com' OR username = 'test';
-- ✅ GOOD: Use UNION (can use separate indexes) SELECT * FROM users WHERE email = 'test@example.com' UNION SELECT * FROM users WHERE username = 'test';
-- ❌ BAD: Function on indexed column SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- ✅ GOOD: Functional index or avoid function CREATE INDEX idx_users_email_lower ON users(LOWER(email)); -- Or just: SELECT * FROM users WHERE email = 'test@example.com';
- N+1 Query Detection
Problem:
Python/SQLAlchemy example
❌ N+1 Query Problem
users = User.query.all() # 1 query for user in users: posts = user.posts # N queries (one per user) print(f"{user.name}: {len(posts)} posts")
Total: 1 + N queries
Solution:
✅ Eager Loading
users = User.query.options(joinedload(User.posts)).all() # 1 query for user in users: posts = user.posts # No additional query print(f"{user.name}: {len(posts)} posts")
Total: 1 query
Node.js/Sequelize:
// ❌ N+1 Problem const users = await User.findAll(); for (const user of users) { const posts = await user.getPosts(); // N queries }
// ✅ Solution: Include associations const users = await User.findAll({ include: [{ model: Post }] // 1 query with JOIN });
Rails/ActiveRecord:
❌ N+1 Problem
users = User.all users.each do |user| puts user.posts.count # N queries end
✅ Solution: includes
users = User.includes(:posts) users.each do |user| puts user.posts.count # No additional queries end
- Index Suggestions
Automated analysis:
-- PostgreSQL: Find missing indexes SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname = 'public' AND n_distinct > 100 AND correlation < 0.5 ORDER BY n_distinct DESC;
-- Find tables with sequential scans SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch FROM pg_stat_user_tables WHERE seq_scan > 0 AND seq_tup_read / seq_scan > 10000 ORDER BY seq_tup_read DESC;
-- Unused indexes SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexrelname NOT LIKE 'pg_toast%' ORDER BY pg_relation_size(indexrelid) DESC;
MySQL:
-- Missing indexes SELECT * FROM sys.schema_unused_indexes;
-- Duplicate indexes SELECT * FROM sys.schema_redundant_indexes;
-- Table scan queries SELECT * FROM sys.statements_with_full_table_scans LIMIT 10;
- Query Optimization Checklist
Python Script:
#!/usr/bin/env python3 import psycopg2 import re
class QueryOptimizer: def init(self, conn): self.conn = conn
def analyze_query(self, query):
"""Analyze query and provide optimization suggestions."""
suggestions = []
# Check for SELECT *
if re.search(r'SELECT\s+\*', query, re.IGNORECASE):
suggestions.append("❌ Avoid SELECT *. Specify only needed columns.")
# Check for missing WHERE clause
if re.search(r'FROM\s+\w+', query, re.IGNORECASE) and \
not re.search(r'WHERE', query, re.IGNORECASE):
suggestions.append("⚠️ No WHERE clause. Consider adding filters.")
# Check for OR in WHERE
if re.search(r'WHERE.*\sOR\s', query, re.IGNORECASE):
suggestions.append("⚠️ OR conditions may prevent index usage. Consider UNION.")
# Check for functions on indexed columns
if re.search(r'WHERE\s+\w+\([^\)]+\)\s*=', query, re.IGNORECASE):
suggestions.append("❌ Functions on columns prevent index usage.")
# Check for LIKE with leading wildcard
if re.search(r'LIKE\s+[\'"]%', query, re.IGNORECASE):
suggestions.append("❌ LIKE with leading % cannot use index.")
# Run EXPLAIN
cursor = self.conn.cursor()
try:
cursor.execute(f"EXPLAIN ANALYZE {query}")
plan = cursor.fetchall()
# Check for sequential scans
plan_str = str(plan)
if 'Seq Scan' in plan_str:
suggestions.append("❌ Sequential scan detected. Consider adding index.")
# Check for high cost
cost_match = re.search(r'cost=(\d+\.\d+)', plan_str)
if cost_match:
cost = float(cost_match.group(1))
if cost > 10000:
suggestions.append(f"⚠️ High query cost: {cost:.2f}")
return {
'suggestions': suggestions,
'explain_plan': plan
}
finally:
cursor.close()
def suggest_indexes(self, query):
"""Suggest indexes based on query pattern."""
indexes = []
# Find WHERE conditions
where_matches = re.findall(r'WHERE\s+(\w+)\s*[=<>]', query, re.IGNORECASE)
for col in where_matches:
indexes.append(f"CREATE INDEX idx_{col} ON table_name({col});")
# Find JOIN conditions
join_matches = re.findall(r'ON\s+\w+\.(\w+)\s*=\s*\w+\.(\w+)', query, re.IGNORECASE)
for col1, col2 in join_matches:
indexes.append(f"CREATE INDEX idx_{col1} ON table_name({col1});")
indexes.append(f"CREATE INDEX idx_{col2} ON table_name({col2});")
# Find ORDER BY
order_matches = re.findall(r'ORDER BY\s+(\w+)', query, re.IGNORECASE)
for col in order_matches:
indexes.append(f"CREATE INDEX idx_{col} ON table_name({col});")
return list(set(indexes))
Usage
conn = psycopg2.connect("dbname=mydb user=postgres") optimizer = QueryOptimizer(conn)
query = """ SELECT u.name, u.email, COUNT(p.id) FROM users u LEFT JOIN posts p ON u.id = p.user_id WHERE u.created_at > '2024-01-01' GROUP BY u.id ORDER BY COUNT(p.id) DESC LIMIT 10 """
result = optimizer.analyze_query(query) for suggestion in result['suggestions']: print(suggestion)
print("\nSuggested indexes:") for index in optimizer.suggest_indexes(query): print(index)
- MongoDB Optimization
Analyze Query:
db.users.find({ created_at: { $gt: ISODate("2024-01-01") }, status: "active" }).sort({ created_at: -1 }).explain("executionStats")
Check for issues:
// Check execution stats const stats = db.users.find({ status: "active" }).explain("executionStats");
// Red flags: // - totalDocsExamined >> nReturned (scanning many docs) // - COLLSCAN stage (no index used) // - High executionTimeMillis
// Create index db.users.createIndex({ status: 1, created_at: -1 });
// Compound index for specific query db.users.createIndex({ status: 1, created_at: -1, name: 1 });
- ORM Query Optimization
Django:
❌ N+1 Problem
users = User.objects.all() for user in users: print(user.profile.bio) # N queries
✅ select_related (for ForeignKey/OneToOne)
users = User.objects.select_related('profile').all()
✅ prefetch_related (for ManyToMany/reverse ForeignKey)
users = User.objects.prefetch_related('posts').all()
❌ Loading all records
users = User.objects.all() # Loads everything into memory
✅ Use iterator for large datasets
for user in User.objects.iterator(chunk_size=1000): process(user)
❌ Multiple queries
active_users = User.objects.filter(is_active=True).count() inactive_users = User.objects.filter(is_active=False).count()
✅ Single aggregation
from django.db.models import Count, Q stats = User.objects.aggregate( active=Count('id', filter=Q(is_active=True)), inactive=Count('id', filter=Q(is_active=False)) )
TypeORM:
// ❌ N+1 Problem const users = await userRepository.find(); for (const user of users) { const posts = await postRepository.find({ where: { userId: user.id } }); }
// ✅ Use relations const users = await userRepository.find({ relations: ['posts', 'profile'] });
// ✅ Query Builder for complex queries const users = await userRepository .createQueryBuilder('user') .leftJoinAndSelect('user.posts', 'post') .where('user.created_at > :date', { date: '2024-01-01' }) .andWhere('post.status = :status', { status: 'published' }) .getMany();
// Use select to limit columns const users = await userRepository .createQueryBuilder('user') .select(['user.id', 'user.name', 'user.email']) .getMany();
- Performance Monitoring
PostgreSQL:
-- Top slow queries SELECT query, calls, total_time, mean_time, max_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;
-- Table bloat SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS external_size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;
MySQL:
-- Slow queries SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;
-- Table statistics SELECT TABLE_NAME, TABLE_ROWS, DATA_LENGTH, INDEX_LENGTH, DATA_FREE FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'your_database' ORDER BY DATA_LENGTH DESC;
Best Practices
DO:
-
Add indexes on foreign keys
-
Use EXPLAIN regularly
-
Monitor slow query log
-
Use connection pooling
-
Implement pagination
-
Cache frequent queries
-
Use appropriate data types
-
Regular VACUUM/ANALYZE
DON'T:
-
Use SELECT *
-
Over-index (slows writes)
-
Use LIKE with leading %
-
Use functions on indexed columns
-
Ignore N+1 queries
-
Load entire tables
-
Skip query analysis
-
Use OR excessively
Checklist
-
Slow queries identified
-
EXPLAIN plans analyzed
-
Indexes added where needed
-
N+1 queries fixed
-
Query rewrites implemented
-
Monitoring setup
-
Connection pool configured
-
Caching implemented