System Architecture Expert

When to use this Skill

Use this Skill when:

Designing distributed systems
Writing system design documentation
Preparing for system design interviews
Creating architecture diagrams
Analyzing trade-offs between design choices
Reviewing or improving existing system designs

System Design Framework

Requirements Gathering (5-10 minutes)

Functional Requirements:

What are the core features?
What actions can users perform?
What are the inputs and outputs?

Non-Functional Requirements:

Scale: How many users? How much data?
Performance: Latency requirements? (p50, p95, p99)
Availability: What uptime is needed? (99.9%, 99.99%)
Consistency: Strong or eventual consistency?

Constraints:

Budget limitations
Technology stack constraints
Team expertise
Timeline

Example Questions:

How many daily active users?
What's the read:write ratio?
What's the average data size?
What's the peak load vs average load?
Do we need real-time updates?
Can we have data loss?

Capacity Estimation (Back-of-the-envelope)

Calculate:

Traffic:

DAU = 100M users
Each user makes 10 requests/day
QPS = 100M * 10 / 86400 ≈ 11,574 QPS
Peak QPS = 2-3x average ≈ 30,000 QPS

Storage:

100M users * 1KB per user = 100GB
With 3x replication = 300GB
Growth: 300GB * 365 days = 109.5TB/year

Bandwidth:

QPS * average request size
11,574 * 10KB = 115.74MB/s

Memory/Cache:

80-20 rule: 20% of data gets 80% of traffic
Cache = 20% of total data for hot data

High-Level Design

Core Components:

Client Layer (Web, Mobile, Desktop)
API Gateway / Load Balancer
Application Servers (Business logic)
Cache Layer (Redis, Memcached)
Database (SQL, NoSQL, or both)
Message Queue (Kafka, RabbitMQ)
Object Storage (S3, GCS)
CDN (CloudFront, Akamai)

Draw Architecture:

[Clients] → [CDN] ↓ [Load Balancer] ↓ [Application Servers] ↙ ↓ ↘ [Cache] [DB] [Queue] → [Workers] ↓ [Object Storage]

Database Design

SQL vs NoSQL Decision:

Use SQL when:

ACID transactions required
Complex queries with JOINs
Structured data with relationships
Examples: PostgreSQL, MySQL

Use NoSQL when:

Massive scale (horizontal scaling)
Flexible schema
High write throughput
Examples: Cassandra, DynamoDB, MongoDB

Sharding Strategy:

Hash-based: user_id % num_shards
Range-based: Users 1-100M on shard 1
Geographic: US users on US shard
Consistent hashing: For even distribution

Schema Design:

-- Example: URL Shortener CREATE TABLE urls ( id BIGSERIAL PRIMARY KEY, short_url VARCHAR(10) UNIQUE NOT NULL, long_url TEXT NOT NULL, user_id BIGINT, created_at TIMESTAMP DEFAULT NOW(), expires_at TIMESTAMP, click_count INT DEFAULT 0, INDEX (short_url), INDEX (user_id) );

Deep Dive Components

Caching Strategy:

Cache-Aside: App reads from cache, loads from DB on miss
Write-Through: Write to cache and DB together
Write-Behind: Write to cache, async write to DB

Eviction Policies:

LRU (Least Recently Used) - Most common
LFU (Least Frequently Used)
TTL (Time To Live)

Load Balancing:

Round Robin: Simple, equal distribution
Least Connections: Route to least busy server
Consistent Hashing: Minimize redistribution
Weighted: Based on server capacity

Message Queue Patterns:

Pub/Sub: One-to-many (notifications)
Work Queue: Task distribution (job processing)
Fan-out: Broadcast to multiple queues

Scalability Patterns

Horizontal Scaling:

Add more servers
Use load balancers
Stateless application servers
Session stored in cache/DB

Vertical Scaling:

Add more CPU/RAM to servers
Limited by hardware
Simpler but has limits

Microservices:

Monolith: [Single App] → [DB]

Microservices: [User Service] → [User DB] [Post Service] → [Post DB] [Feed Service] → [Feed DB]

Benefits:

Independent scaling
Technology flexibility
Fault isolation

Drawbacks:

Increased complexity
Network latency
Distributed transactions

Reliability & Availability

Replication:

Master-Slave: One writer, multiple readers
Master-Master: Multiple writers (conflict resolution needed)
Multi-region: Geographic redundancy

Failover:

Active-Passive: Standby server takes over
Active-Active: Both servers handle traffic

Rate Limiting:

Token bucket algorithm
Leaky bucket algorithm
Fixed window counter
Sliding window log

Circuit Breaker:

States: Closed → Normal operation Open → Reject requests immediately Half-Open → Test if service recovered

Common System Design Patterns

Content Delivery:

Use CDN for static assets
Geo-distributed edge servers
Cache at edge locations

Data Consistency:

Strong Consistency: Read reflects latest write (ACID)
Eventual Consistency: Reads eventually reflect write (BASE)
CAP Theorem: Choose 2 of 3: Consistency, Availability, Partition Tolerance

API Design:

RESTful: GET /api/users/{id} POST /api/users PUT /api/users/{id} DELETE /api/users/{id}

GraphQL: query { user(id: "123") { name posts { title } } }

System Design Template

Use this structure (based on system_design/00_template.md ):

{System Name}

1. Requirements

Functional

[List core features]

Non-Functional

Scale: [Users, QPS, Data]
Performance: [Latency requirements]
Availability: [Uptime target]

2. Capacity Estimation

Traffic: [QPS calculations]
Storage: [Data size, growth]
Bandwidth: [Network requirements]

3. API Design

[endpoint] - [description]

4. High-Level Architecture

[Diagram]

5. Database Schema

[Tables and relationships]

6. Detailed Design

Component 1

[Deep dive]

Component 2

[Deep dive]

7. Scalability

[How to scale each component]

8. Trade-offs

[Decisions and alternatives]

Real-World Examples

Reference case studies in system_design/ :

Netflix: Video streaming, recommendation
Twitter: Timeline, tweet storage, trending
Uber: Real-time matching, location tracking
Instagram: Image storage, feed generation
WhatsApp: Message delivery, presence

Common Patterns:

News Feed: Fan-out on write vs fan-out on read
Rate Limiter: Token bucket with Redis
URL Shortener: Base62 encoding, hash collision
Chat System: WebSocket, message queue
Notification: Push notification service, APNs/FCM

Interview Tips

Time Management:

Requirements: 10%
High-level design: 25%
Deep dive: 50%
Wrap up: 15%

Communication:

Think out loud
Ask clarifying questions
Discuss trade-offs
Acknowledge limitations

What interviewers look for:

Problem-solving approach
Technical depth
Trade-off analysis
Scale awareness
Communication skills

Common Mistakes to Avoid

Jumping to solution without requirements
Over-engineering simple problems
Under-estimating scale requirements
Ignoring single points of failure
Not considering monitoring/alerting
Forgetting about data consistency
Missing security considerations

Project Context

Templates in system_design/00_template.md
Case studies in system_design/*.md
Reference materials in doc/system_design/
Follow the established documentation pattern

system-architecture

Safety Notice

Copy this and send it to your AI assistant to learn