Database Architect Skill
Step 1: Understand Data Requirements
Gather requirements:
-
Entities: What data needs to be stored?
-
Relationships: How do entities relate (1:1, 1:N, N:M)?
-
Access Patterns: How will data be queried?
-
Volume: Expected data size and growth rate
-
Consistency: ACID requirements vs eventual consistency
Step 2: Design Schema
For Relational Databases:
-
Normalize: Start with 3NF to reduce redundancy
-
Define Primary Keys: Use surrogate keys (UUID/SERIAL) or natural keys
-
Define Foreign Keys: Establish referential integrity
-
Consider Denormalization: Only for proven performance needs
For NoSQL Databases:
-
Model for Queries: Design documents/collections around access patterns
-
Embed vs Reference: Embed for 1:1/1:few, reference for 1:many
-
Shard Key Selection: Choose keys that distribute evenly
Step 3: Plan Indexes
Index strategy based on query patterns:
-- Example: Users table with common queries CREATE INDEX idx_users_email ON users(email); -- Exact match CREATE INDEX idx_users_name ON users(last_name, first_name); -- Range/sort CREATE INDEX idx_users_created ON users(created_at DESC); -- Ordering
Index Guidelines:
-
Index columns used in WHERE, JOIN, ORDER BY
-
Consider composite indexes for multi-column queries
-
Avoid over-indexing (slows writes)
-
Use covering indexes for read-heavy queries
Step 4: Plan Migrations
Create versioned migrations:
migrations/ 001_create_users.sql 002_add_email_index.sql 003_create_orders.sql
Migration Best Practices:
-
Always include up and down migrations
-
Test migrations on production-like data
-
Plan for zero-downtime migrations
-
Backup before running migrations
Step 5: Optimize Queries
Analyze and improve slow queries:
-
Use EXPLAIN ANALYZE: Understand execution plans
-
Identify Table Scans: Replace with index scans
-
Optimize JOINs: Ensure indexes on join columns
-
Batch Operations: Use bulk inserts/updates
-
Connection Pooling: Reduce connection overhead
Step 6: PostgreSQL 17 Features (2024–2026)
Leverage PostgreSQL 17 capabilities where applicable:
Performance improvements:
-
New VACUUM memory management — up to 20x lower memory footprint; vacuum now runs faster on busy systems
-
Streaming I/O interface accelerates sequential scans on large datasets
-
BRIN indexes support parallel builds
-
B-tree indexes are more efficient for IN clause queries
-
Optimized CTE (Common Table Expression) planning
SQL/JSON enhancements (PG 17):
-
JSON_TABLE() — converts JSON data into relational table representation
-
JSON constructors and identity functions (JSON() , JSON_SCALAR() , JSON_ARRAY() , JSON_OBJECT() )
-
Use jsonpath for expressive path-based queries over JSONB columns
Incremental backups:
- pg_basebackup supports incremental backup; combine with pg_upgrade for zero-data-loss major version upgrades
Logical replication improvements:
-
Failover control for logical replication slots
-
pg_createsubscriber creates logical replicas from physical standbys
-
pg_upgrade now preserves logical replication slots across major version upgrades
Security:
-
New MAINTAIN privilege — grants targeted maintenance rights without full superuser access
-
sslnegotiation=direct client option for direct TLS handshake (avoids round-trip)
COPY improvements:
- COPY ... ON_ERROR ignore — continues import on row-level errors instead of aborting
Step 7: pgvector for AI Embeddings
Store and query vector embeddings alongside relational data to avoid a separate vector database:
-- Install extension CREATE EXTENSION IF NOT EXISTS vector;
-- Table with embedding column (1536 dims for OpenAI text-embedding-3-small) CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), content TEXT NOT NULL, embedding vector(1536), created_at TIMESTAMPTZ DEFAULT NOW() );
-- IVFFlat index for approximate nearest neighbor (ANN) search -- lists = sqrt(row_count) is a good starting value CREATE INDEX idx_documents_embedding ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- HNSW index (faster queries, more memory; preferred for < 1M vectors) CREATE INDEX idx_documents_embedding_hnsw ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
-- Similarity search (cosine distance) SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity FROM documents ORDER BY embedding <=> $1::vector LIMIT 10;
When to use pgvector vs. dedicated vector DB:
-
Up to ~10M vectors: pgvector is sufficient (sub-50ms queries with HNSW index)
-
Above 10M vectors or requiring specialized ANN algorithms: consider Pinecone, Weaviate, or Qdrant
-
pgvector advantage: same backups, replication, and connection pooling as the rest of PostgreSQL
Step 8: Table Partitioning Strategies
Use declarative partitioning for tables expected to exceed available RAM:
-- Range partitioning by date (common for time-series / logs) CREATE TABLE events ( id BIGSERIAL, created_at TIMESTAMPTZ NOT NULL, event_type TEXT NOT NULL, payload JSONB ) PARTITION BY RANGE (created_at);
-- Monthly partitions CREATE TABLE events_2025_01 PARTITION OF events FOR VALUES FROM ('2025-01-01') TO ('2025-02-01'); CREATE TABLE events_2025_02 PARTITION OF events FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Hash partitioning for even distribution (e.g., multi-tenant) CREATE TABLE orders ( id UUID NOT NULL, tenant_id UUID NOT NULL, total DECIMAL(12,2) ) PARTITION BY HASH (tenant_id);
CREATE TABLE orders_p0 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 0); CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 1); CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 2); CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 3);
Partition pruning: PostgreSQL automatically skips irrelevant partitions when the partition key appears in WHERE . Always include the partition key in queries.
Index on partitioned tables: Indexes created on the parent table are automatically created on all child partitions.
Step 9: JSONB Patterns at Scale
-- Generated columns promote hot JSONB fields to indexed native columns CREATE TABLE customers ( id BIGSERIAL PRIMARY KEY, data JSONB NOT NULL, -- Promote frequently filtered fields to B-tree indexed generated columns country TEXT GENERATED ALWAYS AS (data->>'country') STORED, signup_date DATE GENERATED ALWAYS AS ((data->>'signup_date')::DATE) STORED ); CREATE INDEX idx_customers_country ON customers (country); CREATE INDEX idx_customers_signup ON customers (signup_date);
-- GIN index for containment / key-existence queries CREATE INDEX idx_customers_data_gin ON customers USING GIN (data);
-- Partial GIN index for large tables (index only active records) CREATE INDEX idx_customers_data_active ON customers USING GIN (data) WHERE (data->>'status') = 'active';
-- jsonpath query example (PG 17) SELECT * FROM customers WHERE data @? '$.tags[*] ? (@ == "premium")';
Step 10: Connection Pooling
Use a connection pooler in front of PostgreSQL for all production deployments:
PgBouncer (lightweight, battle-tested):
pgbouncer.ini
[databases] mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer] pool_mode = transaction ; transaction pooling for stateless apps max_client_conn = 1000 default_pool_size = 25 server_pool_size = 5
Supavisor (cloud-native, multi-tenant, Elixir-based):
-
Designed for serverless / edge functions with thousands of short-lived connections
-
Supports both session and transaction pooling modes
-
Used by default in Supabase deployments; available as self-hosted
Pooling modes:
Mode Use Case Notes
Session Long-running connections, LISTEN/NOTIFY
1 client = 1 server connection
Transaction Stateless APIs (recommended default) Most efficient; breaks SET / prepared statements
Statement Rarely needed Each statement can use a different server connection
</execution_process>
<best_practices>
-
Normalize First: Optimize later based on data
-
Index Thoughtfully: Based on actual query patterns
-
Use Migrations: Never modify schema directly
-
Monitor Performance: Use database profiling tools
-
Plan for Scale: Consider partitioning for large tables
-
Upgrade to PostgreSQL 17: Benefit from new VACUUM memory manager, SQL/JSON functions, and incremental backup support
-
Use pgvector for AI: Store embeddings in PostgreSQL with HNSW or IVFFlat indexes before reaching for a dedicated vector database
-
JSONB at Scale: Promote hot JSONB fields to generated columns with B-tree indexes; use GIN for containment queries
-
Always Pool Connections: Use PgBouncer (transaction mode) or Supavisor for all production PostgreSQL deployments
-
Partition Large Tables: Apply range (time-series), list (category), or hash (multi-tenant) partitioning for tables projected to exceed server RAM
</best_practices>
Design a schema for an e-commerce orders system
Example Response:
-- Users table CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email VARCHAR(255) UNIQUE NOT NULL, created_at TIMESTAMP DEFAULT NOW() );
-- Orders table CREATE TABLE orders ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id UUID REFERENCES users(id), status VARCHAR(20) DEFAULT 'pending', total_amount DECIMAL(10,2), created_at TIMESTAMP DEFAULT NOW() );
-- Order items (1:N relationship) CREATE TABLE order_items ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), order_id UUID REFERENCES orders(id) ON DELETE CASCADE, product_id UUID NOT NULL, quantity INTEGER NOT NULL, unit_price DECIMAL(10,2) );
-- Indexes for common queries CREATE INDEX idx_orders_user ON orders(user_id); CREATE INDEX idx_orders_status ON orders(status); CREATE INDEX idx_order_items_order ON order_items(order_id);
</usage_example>
Rules
-
Always justify denormalization with performance data
-
Include rollback strategy for all migrations
-
Document relationships and constraints
Related Workflow
This skill has a corresponding workflow for complex multi-agent scenarios:
-
Workflow: .claude/workflows/database-architect-skill-workflow.md
-
When to use workflow: For comprehensive database design including requirements analysis, schema design, query optimization, migration planning, and testing (multi-phase, multi-agent)
-
When to use skill directly: For quick schema reviews or single-agent database tasks
Iron Laws
-
NEVER make schema changes without versioned migrations that include both UP and DOWN scripts — manual DDL in production is not recoverable.
-
ALWAYS normalize to at least 3NF before considering denormalization — never prematurely optimize without measured performance evidence.
-
ALWAYS plan indexes based on actual query patterns from EXPLAIN ANALYZE — never add indexes speculatively before profiling real workloads.
-
NEVER test or deploy a migration without running it against production-like data first — schema issues surface under realistic volume, not on empty tables.
-
ALWAYS use connection pooling (Supavisor or PgBouncer) in production — direct connections from serverless functions exhaust the database connection limit under load.
Anti-Patterns
Anti-Pattern Why It Fails Correct Approach
Manual DDL directly on production No rollback path; breaks migration history Always use versioned migrations with DOWN scripts
Premature denormalization Adds complexity before profiling; often no measurable gain Normalize first, denormalize only after EXPLAIN ANALYZE reveals bottleneck
Indexing every column Slows writes; wastes storage; misleads query planner Index only columns that appear in WHERE, JOIN, and ORDER BY clauses
Adding NOT NULL column without default Locks entire table during migration on large datasets Add nullable column, backfill in batches, then add NOT NULL constraint
Direct connections from serverless functions Connection limit exhausted under load spike Use PgBouncer or Supavisor in transaction-pooling mode
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
After completing:
-
New pattern -> .claude/context/memory/learnings.md
-
Issue found -> .claude/context/memory/issues.md
-
Decision made -> .claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.