aws-finops

AWS FinOps Skill for Opsy

Step 1: Cost Explorer First

Start with Cost Explorer — one call covers all regions and services:

Spend by service — identifies top cost drivers
Spend by region — shows where resources live
Daily trend — spots anomalies

Focus on services representing >5% of spend.

If Credits Mask Costs ($0 spend)

Check if Resource Explorer is enabled:

aws resource-explorer-2 list-indexes --region us-east-1

If enabled, use it — one call gets ALL resources:

aws resource-explorer-2 search --query-string "*" --region us-east-1

If NOT enabled, use resourcegroupstaggingapi to find all tagged resources:

aws resourcegroupstaggingapi get-resources --region us-east-1

Then query each active region for core services: EC2, RDS, EBS, Lambda, S3, ECS, EKS, NAT Gateways, Load Balancers.

Step 2: Deep Dive Each Resource

For every resource found, gather full details:

EC2: Instance type, state, launch time, CloudWatch CPU/memory
RDS: Instance class, connections (14d), storage, Multi-AZ, engine
EBS: Attachment status, volume type, size, snapshots
S3: Lifecycle policies, storage class, versioning
Lambda: Invocations (30d), memory, runtime, provisioned concurrency
ECS/EKS: Task definitions, service counts, cluster utilization
ECR: Repositories, image count, lifecycle policies
Load Balancers: Request count (14d), target groups
NAT Gateway: Data processed
Elastic IPs: Association status
CloudWatch Logs: Retention settings
Secrets Manager: Secret count

Check EVERY resource for optimization opportunities. Don't skip services.

Step 3: Check Commitment Coverage

Savings Plans utilization
Reserved Instance coverage gaps
Expiring commitments (next 30 days)

Safety Guardrails

Report findings with evidence, suggest investigation — not direct actions:

"Instance i-xxx averaged 3% CPU over 30 days — rightsizing candidate"
"Volume vol-xxx unattached since [date] — verify before removing"
"RDS db-xxx had 0 connections for 14 days — confirm if still needed"

Thresholds:

Idle: ~0% utilization for 14+ days
Underutilized: <10% average for 14+ days
Rightsizing candidate: <30% average

Smart Recommendation Rules

Only flag when action is possible:

Situation Action

Minimum size + in use (db.t3.micro with connections) Skip — already right-sized

Minimum size + idle (db.t3.micro, 0 connections) Flag as idle

Larger size + low utilization Flag for rightsizing with specific target

Tagged FinOps:Skip=true

Skip

Dev/staging with Environment=dev

Skip low utilization (expected)

Before flagging, verify:

Is this the minimum size?
Is it actually in use? (connections/invocations/requests)
Is there a smaller option?

Service Checklists

EC2: Utilization, stopped instances (EBS cost), previous-gen types, On-Demand 24/7 → SP/RI

Lambda: Zero invocations (30d), memory vs duration tradeoff, provisioned concurrency

ECS/EKS: Fargate vs EC2, resource requests vs usage, Spot for fault-tolerant

ECR: Lifecycle policies, image count, total size — old images accumulate

RDS: Connection count, Multi-AZ in dev, instance class utilization, storage, previous-gen

DynamoDB: Provisioned vs On-Demand fit, auto-scaling, TTL

ElastiCache/OpenSearch: Node utilization, reserved coverage

S3: Lifecycle policies, storage class, Intelligent-Tiering, incomplete multipart uploads

EBS: Unattached volumes, gp2→gp3, snapshot retention, IOPS necessity

Networking: Cross-AZ transfer, NAT Gateway → VPC endpoints, CloudFront caching

Load Balancers: Zero requests = orphaned, Classic→ALB/NLB

Elastic IPs: Unassociated = $3.60/month each

CloudWatch: Log retention (default infinite), high-res metrics necessity

Secrets Manager: $0.40/month vs free Parameter Store

API Gateway: HTTP API 70% cheaper than REST

Output Requirements

CSV (Required)

account_id,resource_name,status,recommendation_type,potential_savings_monthly,resource_id,region,resource_type,tags,description 123456789012,web-server-prod,Underutilized,Rightsizing to t3.small,45.00,i-0abc123def456,us-east-1,EC2 Instance,"Environment=prod,Team=platform","Avg CPU 8% over 30 days. Current: t3.large" 123456789012,,Unattached,Verify before removing,12.50,vol-0xyz789,us-east-1,EBS Volume,,"100GB gp2 volume unattached since 2024-12-01" 123456789012,raspberry,No-Lifecycle,Add ECR lifecycle policy,2.00,raspberry,us-east-1,ECR Repository,,"47 images totaling 12GB. No lifecycle policy configured"

Column Description

account_id

AWS account ID

resource_name

Name tag value (empty if untagged)

status

Idle , Underutilized , Oversized , Unattached , Previous-Gen , No-Lifecycle , Uncovered

recommendation_type

Specific action: Rightsizing to X , Verify before removing , Add ECR lifecycle policy , Switch to gp3 , Consider SP/RI

potential_savings_monthly

USD/month (conservative estimate)

resource_id

AWS resource ID

region

AWS region

resource_type

EC2 Instance , EBS Volume , RDS Instance , ECR Repository , S3 Bucket , etc.

AWS FinOps Summary

Account: [id] | Date: [date]

Spend Overview

Monthly spend: $X,XXX
Top spenders: [service1] (X%), [service2] (X%), [service3] (X%)

Findings

Total potential savings: $X,XXX/month
Resources flagged: X

Top 5 Opportunities

[resource] - [recommendation] - $X/month

Next Steps

[investigation recommendations]

Save as finops-report-[account-id]-[date].csv

Safety Notice

Copy this and send it to your AI assistant to learn

AWS FinOps Summary

Spend Overview

Findings

Top 5 Opportunities

Next Steps

Source Transparency

Related Skills

aws-wtf

klaviyo

lifelog

unified-self-improving