AWS Cost Optimization & FinOps
Systematic workflows for AWS cost optimization and financial operations management.
When to Use This Skill
Use this skill when you need to:
-
Find cost savings: Identify unused resources, rightsizing opportunities, or commitment discounts
-
Analyze spending: Understand cost trends, detect anomalies, or break down costs
-
Optimize architecture: Choose cost-effective services, storage tiers, or instance types
-
Implement FinOps: Set up governance, tagging, budgets, or monthly reviews
-
Make purchase decisions: Evaluate Reserved Instances, Savings Plans, or Spot instances
-
Troubleshoot costs: Investigate unexpected bills or cost spikes
-
Plan budgets: Forecast costs or evaluate impact of new projects
Cost Optimization Workflow
Follow this systematic approach for AWS cost optimization:
┌─────────────────────────────────────────────┐ │ 1. DISCOVER │ │ What are we spending money on? │ │ Run: find_unused_resources.py │ │ Run: cost_anomaly_detector.py │ └─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ 2. ANALYZE │ │ Where are the optimization opportunities?│ │ Run: rightsizing_analyzer.py │ │ Run: detect_old_generations.py │ │ Run: spot_recommendations.py │ │ Run: analyze_ri_recommendations.py │ └─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ 3. PRIORITIZE │ │ What should we optimize first? │ │ - Quick wins (low risk, high savings) │ │ - Low-hanging fruit (easy to implement) │ │ - Strategic improvements │ └─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ 4. IMPLEMENT │ │ Execute optimization actions │ │ - Delete unused resources │ │ - Rightsize instances │ │ - Purchase commitments │ │ - Migrate to new generations │ └─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ 5. MONITOR │ │ Verify savings and track metrics │ │ - Monthly cost reviews │ │ - Tag compliance monitoring │ │ - Budget variance tracking │ └─────────────────────────────────────────────┘
Core Workflows
Workflow 1: Monthly Cost Optimization Review
Frequency: Run monthly (first week of each month)
Step 1: Find Unused Resources
Scan for waste across all resources
python3 scripts/find_unused_resources.py
Expected output:
- Unattached EBS volumes
- Old snapshots
- Unused Elastic IPs
- Idle NAT Gateways
- Idle EC2 instances
- Unused load balancers
- Estimated monthly savings
Step 2: Analyze Cost Anomalies
Detect unusual spending patterns
python3 scripts/cost_anomaly_detector.py --days 30
Expected output:
- Cost spikes and anomalies
- Top cost drivers
- Period-over-period comparison
- 30-day forecast
Step 3: Identify Rightsizing Opportunities
Find oversized instances
python3 scripts/rightsizing_analyzer.py --days 30
Expected output:
- EC2 instances with low utilization
- RDS instances with low utilization
- Recommended smaller instance types
- Estimated savings
Step 4: Generate Monthly Report
Use the template to compile findings
cp assets/templates/monthly_cost_report.md reports/$(date +%Y-%m)-cost-report.md
Fill in:
- Findings from scripts
- Action items
- Team cost breakdowns
- Optimization wins
Step 5: Team Review Meeting
-
Present findings to engineering teams
-
Assign optimization tasks
-
Track action items to completion
Workflow 2: Commitment Purchase Analysis (RI/Savings Plans)
When: Quarterly or when usage patterns stabilize
Step 1: Analyze Current Usage
Identify workloads suitable for commitments
python3 scripts/analyze_ri_recommendations.py --days 60
Looks for:
- EC2 instances running consistently for 60+ days
- RDS instances with stable usage
- Calculates ROI for 1yr vs 3yr commitments
Step 2: Review Recommendations
Evaluate each recommendation:
✅ Good candidate if:
- Running 24/7 for 60+ days
- Workload is stable and predictable
- No plans to change architecture
- Savings > 30%
❌ Poor candidate if:
- Workload is variable or experimental
- Architecture changes planned
- Instance type may change
- Dev/test environment
Step 3: Choose Commitment Type
Reserved Instances:
-
Standard RI: Highest discount (63%), no flexibility
-
Convertible RI: Moderate discount (54%), can change instance type
-
Best for: Specific instance types, stable workloads
Savings Plans:
-
Compute SP: Flexible across instance types, regions (66% savings)
-
EC2 Instance SP: Flexible across sizes in same family (72% savings)
-
Best for: Variable workloads within constraints
Decision Matrix:
Known instance type, won't change → Standard RI May need to change types → Convertible RI or Compute SP Variable workloads → Compute Savings Plan Maximum flexibility → Compute Savings Plan
Step 4: Purchase and Track
-
Purchase through AWS Console or CLI
-
Tag commitments with purchase date and owner
-
Monitor utilization monthly
-
Aim for >90% utilization
Reference: See references/best_practices.md for detailed commitment strategies
Workflow 3: Instance Generation Migration
When: During architecture reviews or optimization sprints
Step 1: Detect Old Instances
Find outdated instance generations
python3 scripts/detect_old_generations.py
Identifies:
- t2 → t3 migrations (10% savings)
- m4 → m5 → m6i migrations
- Intel → Graviton opportunities (20% savings)
Step 2: Prioritize Migrations
Quick Wins (Low Risk):
t2 → t3: Drop-in replacement, 10% savings m4 → m5: Better performance, 5% savings gp2 → gp3: No downtime, 20% savings
Medium Effort (Test Required):
x86 → Graviton (ARM64): 20% savings
- Requires ARM64 compatibility testing
- Most modern frameworks support ARM64
- Test in staging first
Step 3: Execute Migration
For EC2 (x86 to x86):
-
Stop instance
-
Change instance type
-
Start instance
-
Verify application
For Graviton Migration:
-
Create ARM64 AMI or Docker image
-
Launch new Graviton instance
-
Test thoroughly
-
Cut over traffic
-
Terminate old instance
Step 4: Validate Savings
-
Monitor new costs in Cost Explorer
-
Verify performance is acceptable
-
Document migration for other teams
Reference: See references/best_practices.md → Compute Optimization
Workflow 4: Spot Instance Evaluation
When: For fault-tolerant workloads or Auto Scaling Groups
Step 1: Identify Candidates
Analyze workloads for Spot suitability
python3 scripts/spot_recommendations.py
Evaluates:
- Instances in Auto Scaling Groups (good candidates)
- Dev/test/staging environments
- Batch processing workloads
- CI/CD and build servers
Step 2: Assess Suitability
Excellent for Spot:
-
Stateless applications
-
Batch jobs
-
CI/CD pipelines
-
Data processing
-
Auto Scaling Groups
NOT suitable for Spot:
-
Databases (without replicas)
-
Stateful applications
-
Real-time services
-
Mission-critical workloads
Step 3: Implementation Strategy
Option 1: Fargate Spot (Easiest)
ECS task definition
requiresCompatibilities:
- FARGATE capacityProviderStrategy:
- capacityProvider: FARGATE_SPOT weight: 70 # 70% Spot
- capacityProvider: FARGATE weight: 30 # 30% On-Demand
Option 2: EC2 Auto Scaling with Spot
Mixed instances policy
MixedInstancesPolicy: InstancesDistribution: OnDemandBaseCapacity: 2 OnDemandPercentageAboveBaseCapacity: 30 SpotAllocationStrategy: capacity-optimized LaunchTemplate: Overrides: - InstanceType: m5.large - InstanceType: m5a.large - InstanceType: m5n.large
Option 3: EC2 Spot Fleet
Create Spot Fleet with diverse instance types
aws ec2 request-spot-fleet --spot-fleet-request-config file://spot-fleet.json
Step 4: Implement Interruption Handling
Handle 2-minute termination notice
Instance metadata: /latest/meta-data/spot/instance-action
In application:
- Poll for termination notice
- Gracefully shutdown (save state)
- Drain connections
- Exit
Reference: See references/best_practices.md → Compute Optimization → Spot Instances
Quick Reference: Cost Optimization Scripts
All Scripts Location
ls scripts/
find_unused_resources.py
analyze_ri_recommendations.py
detect_old_generations.py
spot_recommendations.py
rightsizing_analyzer.py
cost_anomaly_detector.py
Script Usage Patterns
Monthly Review (Run all):
python3 scripts/find_unused_resources.py python3 scripts/cost_anomaly_detector.py --days 30 python3 scripts/rightsizing_analyzer.py --days 30
Quarterly Optimization:
python3 scripts/analyze_ri_recommendations.py --days 60 python3 scripts/detect_old_generations.py python3 scripts/spot_recommendations.py
Specific Region Only:
python3 scripts/find_unused_resources.py --region us-east-1 python3 scripts/rightsizing_analyzer.py --region us-west-2
Named AWS Profile:
python3 scripts/find_unused_resources.py --profile production python3 scripts/cost_anomaly_detector.py --profile production --days 60
Script Requirements
Install dependencies
pip install boto3 tabulate
AWS credentials required
Configure via: aws configure
Or use: --profile PROFILE_NAME
Service-Specific Optimization
Compute Optimization
Key Actions:
-
Migrate to Graviton (20% savings)
-
Use Spot for fault-tolerant workloads (70% savings)
-
Purchase RIs for stable workloads (40-65% savings)
-
Right-size oversized instances
Reference: references/best_practices.md → Compute Optimization
Storage Optimization
Key Actions:
-
Convert gp2 → gp3 (20% savings)
-
Implement S3 lifecycle policies (50-95% savings)
-
Delete old snapshots
-
Use S3 Intelligent-Tiering
Reference: references/best_practices.md → Storage Optimization
Network Optimization
Key Actions:
-
Replace NAT Gateways with VPC Endpoints (save $25-30/month each)
-
Use CloudFront to reduce data transfer costs
-
Colocate resources in same AZ when possible
Reference: references/best_practices.md → Network Optimization
Database Optimization
Key Actions:
-
Right-size RDS instances
-
Use gp3 storage (20% cheaper than gp2)
-
Evaluate Aurora Serverless for variable workloads
-
Purchase RDS Reserved Instances
Reference: references/best_practices.md → Database Optimization
Service Alternatives Decision Guide
Need help choosing between services?
Question: "Should I use EC2, Lambda, or Fargate?" Answer: See references/service_alternatives.md → Compute Alternatives
Question: "Which S3 storage class should I use?" Answer: See references/service_alternatives.md → Storage Alternatives
Question: "Should I use RDS or Aurora?" Answer: See references/service_alternatives.md → Database Alternatives
Question: "NAT Gateway vs VPC Endpoint vs NAT Instance?" Answer: See references/service_alternatives.md → Networking Alternatives
FinOps Governance & Process
Setting Up FinOps
Phase 1: Foundation (Month 1)
-
Enable Cost Explorer
-
Set up AWS Budgets
-
Define tagging strategy
-
Activate cost allocation tags
Phase 2: Visibility (Months 2-3)
-
Implement tagging enforcement
-
Run optimization scripts
-
Set up monthly reviews
-
Create team cost reports
Phase 3: Culture (Ongoing)
-
Cost metrics in engineering KPIs
-
Cost review in architecture decisions
-
Regular optimization sprints
-
FinOps champions in each team
Full Guide: See references/finops_governance.md
Monthly Review Process
Week 1: Data Collection
-
Run all optimization scripts
-
Export Cost & Usage Reports
-
Compile findings
Week 2: Analysis
-
Identify trends
-
Find opportunities
-
Prioritize actions
Week 3: Team Reviews
-
Present to engineering teams
-
Discuss optimizations
-
Assign action items
Week 4: Executive Reporting
-
Create executive summary
-
Forecast next quarter
-
Report optimization wins
Template: See assets/templates/monthly_cost_report.md
Detailed Process: See references/finops_governance.md → Monthly Review Process
Cost Optimization Checklist
Quick Wins (Do First)
-
Delete unattached EBS volumes
-
Delete old EBS snapshots (>90 days)
-
Release unused Elastic IPs
-
Convert gp2 → gp3 volumes
-
Stop/terminate idle EC2 instances
-
Enable S3 Intelligent-Tiering
-
Set up AWS Budgets and alerts
Medium Effort (This Quarter)
-
Right-size oversized instances
-
Migrate to newer instance generations
-
Purchase Reserved Instances for stable workloads
-
Implement S3 lifecycle policies
-
Replace NAT Gateways with VPC Endpoints (where applicable)
-
Enable automated resource scheduling (dev/test)
-
Implement tagging strategy and enforcement
Strategic Initiatives (Ongoing)
-
Migrate to Graviton instances
-
Implement Spot for fault-tolerant workloads
-
Establish monthly cost review process
-
Set up cost allocation by team
-
Implement chargeback/showback model
-
Create FinOps culture and practices
Troubleshooting Cost Issues
"My bill suddenly increased"
Run cost anomaly detection:
python3 scripts/cost_anomaly_detector.py --days 30
Check Cost Explorer for service breakdown
Review CloudTrail for resource creation events
Check for AutoScaling events
Verify no Reserved Instances expired
"I need to reduce costs by X%"
Follow the optimization workflow:
-
Run all discovery scripts
-
Calculate total potential savings
-
Prioritize by: Savings Amount × (1 / Effort)
-
Focus on quick wins first
-
Implement strategic changes for long-term
"How do I know if Reserved Instances make sense?"
Run RI analysis:
python3 scripts/analyze_ri_recommendations.py --days 60
Look for:
-
Instances running 60+ days consistently
-
Workloads that won't change
-
Savings > 30%
"Which resources can I safely delete?"
Run unused resource finder:
python3 scripts/find_unused_resources.py
Safe to delete (usually):
-
Unattached EBS volumes (after verifying)
-
Snapshots > 90 days (if backups exist elsewhere)
-
Unused Elastic IPs (after verifying not in DNS)
-
Stopped EC2 instances > 30 days (after confirming abandoned)
Always verify with resource owner before deletion!
Best Practices Summary
-
Tag Everything: Consistent tagging enables cost allocation and accountability
-
Monitor Continuously: Weekly script runs catch waste early
-
Review Monthly: Regular reviews prevent cost drift
-
Right-size Proactively: Don't wait for cost issues to optimize
-
Use Commitments Wisely: RIs/SPs for stable workloads only
-
Test Before Migrating: Especially for Graviton or Spot
-
Automate Cleanup: Scheduled shutdown of dev/test resources
-
Share Wins: Celebrate cost savings to build FinOps culture
Additional Resources
Detailed References:
-
references/best_practices.md : Comprehensive optimization strategies
-
references/service_alternatives.md : Cost-effective service selection
-
references/finops_governance.md : Organizational FinOps practices
Templates:
- assets/templates/monthly_cost_report.md : Monthly reporting template
Scripts:
- All scripts in scripts/ directory with --help for usage
AWS Documentation:
-
AWS Cost Explorer: https://aws.amazon.com/aws-cost-management/aws-cost-explorer/
-
AWS Budgets: https://aws.amazon.com/aws-cost-management/aws-budgets/
-
FinOps Foundation: https://www.finops.org