AWS Well-Architected Framework

Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.

When to Use

Use this skill when:

Reviewing existing AWS architecture for best practices
Designing new cloud systems or applications
Troubleshooting operational issues, security vulnerabilities, or reliability problems
Optimizing costs or improving performance
Preparing for architecture reviews or audits
Migrating workloads to AWS
Addressing compliance or sustainability requirements
User asks "is my architecture good?" or "how can I improve my AWS setup?"

Core Principle

Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.

The Six Pillars

Pillar	Focus	Key Question
Operational Excellence	Run and monitor systems	How do we operate effectively?
Security	Protect information and systems	How do we protect data and resources?
Reliability	Recover from failures	How do we ensure workload availability?
Performance Efficiency	Use resources effectively	How do we meet performance requirements?
Cost Optimization	Avoid unnecessary costs	How do we achieve cost-effective outcomes?
Sustainability	Minimize environmental impact	How do we reduce carbon footprint?

Architecture Review Workflow

CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.

digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}

Red Flags - You're Skipping the Framework:

"This pillar doesn't apply to this workload" - WRONG, every pillar applies
Jumping straight to recommendations without documenting current state
Only reviewing 3-4 pillars instead of all 6
Providing generic advice instead of workload-specific assessment

Pillar 1: Operational Excellence

Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.

Design Principles

Perform operations as code (IaC)
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from operational events and failures

Key Areas

Organization:

How do teams share architecture knowledge?
Are there clear ownership and accountability models?

Prepare:

How do you design workloads for observability?
Infrastructure as code implementation?
Deployment practices (CI/CD)?

Operate:

What's the runbook for common operations?
How do you understand workload health?
How do you respond to events?

Evolve:

How do you learn from operational events?
Process for continuous improvement?

Common Issues & Solutions

Issue	Solution
Manual deployments	Implement CI/CD with CloudFormation/CDK/Terraform
No visibility into system health	Add CloudWatch dashboards, metrics, alarms
Operational procedures outdated	Regular runbook reviews, post-incident learning
Slow incident response	Create automated remediation with Lambda/Systems Manager

Quick Implementation Checklist

Infrastructure defined as code (CloudFormation/CDK/Terraform)
CI/CD pipeline implemented
CloudWatch dashboards for key metrics
Alarms for critical thresholds
Runbooks documented and accessible
Regular game days to test procedures
Post-incident review process

Pillar 2: Security

Goal: Protect data, systems, and assets through cloud security practices.

Design Principles

Implement strong identity foundation
Enable traceability
Apply security at all layers
Automate security best practices
Protect data in transit and at rest
Keep people away from data
Prepare for security events

Key Areas

Security Foundations:

How do you manage credentials and authentication?
IAM roles and policies following least privilege?

Identity and Access Management:

How do you manage identities for people and machines?
MFA enabled for all human access?

Detection:

How do you detect and investigate security events?
CloudTrail, GuardDuty, Security Hub configured?

Infrastructure Protection:

How do you protect networks and compute?
VPC configuration, security groups, NACLs?

Data Protection:

How do you classify and protect data?
Encryption at rest and in transit?

Incident Response:

How do you respond to security incidents?
Incident response plan tested?

Critical Security Patterns

Never Do:

// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});

Always Do:

// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});

Security Checklist

Pillar 3: Reliability

Goal: Ensure workload performs its intended function correctly and consistently.

Design Principles

Automatically recover from failure
Test recovery procedures
Scale horizontally
Stop guessing capacity
Manage change through automation

Key Areas

Foundations:

How do you manage service quotas and constraints?
Network topology designed for HA?

Workload Architecture:

How do you design workload service architecture?
Microservices vs monolith considerations?

Change Management:

How do you monitor workload resources?
How are changes deployed safely?

Failure Management:

How do you back up data?
How do you design for resilience?
DR plan and RTO/RPO defined?

High Availability Patterns

Multi-AZ Deployment:

Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)

Multi-Region Deployment:

Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring

Backup Strategy

Data Type	Solution	RPO	RTO
RDS	Automated backups + snapshots	< 5 min	< 30 min
DynamoDB	Point-in-time recovery	Seconds	Minutes
S3	Versioning + cross-region replication	Real-time	Immediate
EBS	Snapshots via AWS Backup	Hours	Hours

Reliability Checklist

Pillar 4: Performance Efficiency

Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.

Design Principles

Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often
Consider mechanical sympathy

Key Areas

Selection:

How do you select appropriate resource types and sizes?
Compute: EC2, Lambda, Fargate, ECS, EKS?
Database: RDS, DynamoDB, Aurora, ElastiCache?
Storage: S3, EFS, EBS, Glacier?

Review:

How do you evolve workload to use new resources?
Regular review of AWS new features?

Monitoring:

How do you monitor resources?
CloudWatch, X-Ray for distributed tracing?

Trade-offs:

How do you use trade-offs to improve performance?
Caching, consistency models, compression?

Performance Patterns

Caching Strategy:

Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS

Database Selection:

Use Case	Recommended Service
Relational, complex queries	RDS (PostgreSQL/MySQL)
High throughput, simple queries	DynamoDB
Graph relationships	Neptune
Search and analytics	OpenSearch
Time-series data	Timestream
In-memory cache	ElastiCache (Redis/Memcached)

Performance Checklist

Right-sized compute instances (not over-provisioned)
Content delivery through CloudFront
Database read replicas for read-heavy workloads
Caching layer (ElastiCache, DAX, CloudFront)
Asynchronous processing with SQS/SNS/EventBridge
Auto Scaling configured appropriately
Database indexes optimized
Monitoring with CloudWatch and X-Ray
Regular performance testing under load

Pillar 5: Cost Optimization

Goal: Run systems to deliver business value at lowest price point.

Design Principles

Implement cloud financial management
Adopt consumption model
Measure overall efficiency
Stop spending on undifferentiated heavy lifting
Analyze and attribute expenditure

Key Areas

Practice Cloud Financial Management:

Cost allocation tags implemented?
Budgets and alerts configured?

Expenditure and Usage Awareness:

How do you govern usage?
Cost Explorer and AWS Budgets configured?

Cost-Effective Resources:

How do you evaluate cost when selecting services?
Reserved Instances or Savings Plans for predictable workloads?

Manage Demand:

How do you manage demand and supply resources?
Throttling, caching to reduce demand?

Optimize Over Time:

How do you evaluate new services?
Regular review of cost optimization opportunities?

Cost Optimization Strategies

Strategy	Implementation	Potential Savings
Right-sizing	Use Compute Optimizer recommendations	20-40%
Reserved Instances	1-year or 3-year commitments	30-75%
Savings Plans	Flexible compute commitments	30-70%
Spot Instances	Fault-tolerant workloads	50-90%
S3 Intelligent-Tiering	Automatic storage class optimization	40-60%
Auto Scaling	Scale resources with demand	30-50%
Lambda instead of EC2	For appropriate workloads	Varies

Cost Monitoring

// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});

Cost Optimization Checklist

Pillar 6: Sustainability

Goal: Minimize environmental impact of running cloud workloads.

Design Principles

Understand your impact
Establish sustainability goals
Maximize utilization
Anticipate and adopt new, more efficient offerings
Use managed services
Reduce downstream impact

Key Areas

Region Selection:

Choose regions with renewable energy
AWS regions with lower carbon intensity

User Behavior Patterns:

Scale resources with demand
Remove unused resources

Software and Architecture:

Optimize code for efficiency
Use appropriate services (serverless over provisioned)

Data Patterns:

Minimize data movement
Use data compression
Implement lifecycle policies

Hardware Patterns:

Use minimum necessary hardware
Use instance types with best performance per watt

Development Process:

Test sustainability improvements
Measure and report carbon footprint

Sustainability Checklist

Workloads in regions with renewable energy
Auto Scaling to match demand (no idle resources)
Unused resources regularly cleaned up
Graviton processors considered for better efficiency
Managed services used where appropriate
Data lifecycle policies to reduce storage
Efficient code (async processing, optimized queries)
Monitoring resource utilization
Carbon footprint tracked (AWS Customer Carbon Footprint Tool)

Review Process

1. Scoping Phase

Questions to ask:

What is the workload scope? (entire system vs specific component)
What are the business objectives?
What are the compliance requirements?
What are the current pain points?

2. Review Each Pillar

For each pillar, use this template:

Current State:

Document what exists today

Gaps:

What's missing or needs improvement?

Risks:

What are the high/medium/low priority risks?

Recommendations:

Specific, actionable improvements

3. Prioritization Matrix

Priority	Criteria
High	Security vulnerabilities, critical availability risks, major cost waste
Medium	Performance issues, moderate cost optimization, operational improvements
Low	Nice-to-haves, future considerations, minor optimizations

4. Action Plan Template

## Pillar: [Name]

### Issue: [Description]
- **Risk Level:** High/Medium/Low
- **Impact:** [Business impact]
- **Effort:** Low/Medium/High

### Recommendation:
[Specific actions]

### Implementation Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]

### Success Criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]

### Resources:
- [AWS documentation links]
- [Blog posts or examples]

Common Anti-Patterns

Anti-Pattern	Issue	Better Approach
Single AZ deployment	No fault tolerance	Multi-AZ architecture
No IaC	Manual config, drift	CloudFormation/CDK/Terraform
Hardcoded secrets	Security vulnerability	Secrets Manager/Parameter Store
No monitoring	Blind operation	CloudWatch dashboards + alarms
No backups	Data loss risk	Automated backup strategy
Over-provisioning	Cost waste	Right-sizing + Auto Scaling
No cost tracking	Budget overruns	Tags + Budgets + Cost Explorer
Monolithic architecture	Hard to scale	Microservices or serverless

Real-World Example

Scenario: Serverless API with authentication

Architecture Review:

Operational Excellence:

✅ Lambda functions deployed via CDK
✅ CloudWatch logs enabled
❌ Missing: Distributed tracing (X-Ray), dashboards

Security:

❌ CRITICAL: Hardcoded API keys in Lambda environment variables
✅ API Gateway with IAM authorization
❌ Missing: Secrets Manager, encryption at rest

Reliability:

✅ Multi-AZ DynamoDB table
❌ Single region deployment
❌ Missing: Backup strategy, DR plan

Performance:

✅ CloudFront for static assets
❌ No caching for API responses
❌ Lambda cold starts not optimized

Cost:

❌ DynamoDB provisioned capacity, but traffic is spiky
✅ Lambda usage-based pricing
❌ Missing: Budget alerts, cost allocation tags

Sustainability:

✅ Serverless architecture (good utilization)
❌ Unused dev/test resources running 24/7

Priority Actions:

HIGH: Move API keys to Secrets Manager (Security)
HIGH: Implement DynamoDB backups (Reliability)
MEDIUM: Add X-Ray tracing (Operational Excellence)
MEDIUM: Switch DynamoDB to on-demand (Cost)
LOW: Add API Gateway caching (Performance)

Resources

AWS Well-Architected Framework Whitepaper
AWS Well-Architected Tool (Interactive reviews)
Well-Architected Labs
AWS Architecture Center
Sustainability Pillar Whitepaper

Common Mistakes When Using This Framework

Mistake	Why It's Wrong	Correct Approach
"Sustainability doesn't apply to this workload"	Every workload consumes resources and energy	Review all 6 pillars, even if findings are minimal
Skipping current state documentation	Can't measure improvement without baseline	Always document "Current State" before recommendations
Generic recommendations	Not actionable or specific to this workload	Provide specific AWS services, code examples, priorities
No prioritization	Everything seems equally important	Use HIGH/MEDIUM/LOW risk levels, create phased plan
Forgetting about trade-offs	Optimizing one pillar at expense of others	Explicitly call out trade-offs (e.g., multi-region cost vs reliability)

Using This Skill

When conducting architecture reviews:

Start with context - understand business objectives and constraints
Review systematically - go through all 6 pillars, don't skip ANY
Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
Prioritize ruthlessly - security and availability issues first
Be specific - actionable recommendations with examples and AWS service names
Provide resources - link to AWS docs and examples
Create action plan - clear next steps with success criteria and effort estimates
Call out trade-offs - be explicit about costs and benefits of each recommendation

Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.

No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.