AWS Monitoring Skill
This skill helps you monitor and debug AWS resources for the SG Cars Trends platform.
When to Use This Skill
-
Investigating production errors
-
Checking Lambda function logs
-
Monitoring API performance
-
Debugging deployment failures
-
Analyzing CloudWatch metrics
-
Setting up alarms
-
Troubleshooting resource issues
Monitoring Tools
SST Console
SST provides a built-in console for monitoring:
Open SST console for specific stage
npx sst console --stage production npx sst console --stage staging npx sst console --stage dev
Features:
-
Real-time Lambda logs
-
Function invocations
-
Error tracking
-
Resource overview
-
Environment variables
CloudWatch Logs
Access Lambda logs via CloudWatch:
View logs using SST
npx sst logs --stage production
View specific function logs
npx sst logs --stage production --function api
Tail logs in real-time
npx sst logs --stage production --function api --tail
Filter logs
npx sst logs --stage production --function api --filter "ERROR"
Show logs from specific time
npx sst logs --stage production --function api --since 1h npx sst logs --stage production --function api --since "2024-01-15 10:00"
AWS CLI
Use AWS CLI for advanced log queries:
List log groups
aws logs describe-log-groups
--log-group-name-prefix "/aws/lambda/sgcarstrends"
Get recent log streams
aws logs describe-log-streams
--log-group-name "/aws/lambda/sgcarstrends-api-production"
--order-by LastEventTime
--descending
--max-items 5
Tail logs
aws logs tail "/aws/lambda/sgcarstrends-api-production" --follow
Filter logs
aws logs filter-log-events
--log-group-name "/aws/lambda/sgcarstrends-api-production"
--filter-pattern "ERROR"
--start-time $(date -u -d '1 hour ago' +%s)000
Get logs for specific request
aws logs filter-log-events
--log-group-name "/aws/lambda/sgcarstrends-api-production"
--filter-pattern "request-id-here"
CloudWatch Metrics
Lambda Metrics
Get Lambda invocations
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=sgcarstrends-api-production
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Sum
Get errors
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Errors
--dimensions Name=FunctionName,Value=sgcarstrends-api-production
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Sum
Get duration
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Duration
--dimensions Name=FunctionName,Value=sgcarstrends-api-production
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Average,Maximum
API Gateway Metrics
Get API requests
aws cloudwatch get-metric-statistics
--namespace AWS/ApiGateway
--metric-name Count
--dimensions Name=ApiName,Value=sgcarstrends-api
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Sum
Get 4XX errors
aws cloudwatch get-metric-statistics
--namespace AWS/ApiGateway
--metric-name 4XXError
--dimensions Name=ApiName,Value=sgcarstrends-api
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Sum
Get latency
aws cloudwatch get-metric-statistics
--namespace AWS/ApiGateway
--metric-name Latency
--dimensions Name=ApiName,Value=sgcarstrends-api
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Average,Maximum,p99
CloudWatch Alarms
Creating Alarms
// infra/alarms.ts import { StackContext, use } from "sst/constructs"; import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch"; import * as sns from "aws-cdk-lib/aws-sns"; import * as subscriptions from "aws-cdk-lib/aws-sns-subscriptions"; import { API } from "./api";
export function Alarms({ stack, app }: StackContext) { const { api } = use(API);
// Only create alarms for production if (app.stage !== "production") { return; }
// SNS topic for alarms const alarmTopic = new sns.Topic(stack, "AlarmTopic");
// Add email subscription alarmTopic.addSubscription( new subscriptions.EmailSubscription("alerts@sgcarstrends.com") );
// High error rate alarm new cloudwatch.Alarm(stack, "ApiHighErrorRate", { metric: api.metricErrors(), threshold: 10, evaluationPeriods: 2, datapointsToAlarm: 2, alarmDescription: "API has high error rate", treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
// High duration alarm new cloudwatch.Alarm(stack, "ApiHighDuration", { metric: api.metricDuration(), threshold: 5000, // 5 seconds evaluationPeriods: 2, datapointsToAlarm: 2, alarmDescription: "API response time is high", treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
// Throttle alarm new cloudwatch.Alarm(stack, "ApiThrottled", { metric: api.metricThrottles(), threshold: 1, evaluationPeriods: 1, alarmDescription: "API is being throttled", treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic)); }
Add to SST config:
// infra/sst.config.ts import { Alarms } from "./alarms";
export default { stacks(app) { app .stack(DNS) .stack(API) .stack(Web) .stack(Alarms); // Add alarms stack }, } satisfies SSTConfig;
Managing Alarms via CLI
List alarms
aws cloudwatch describe-alarms
Get alarm state
aws cloudwatch describe-alarms
--alarm-names "sgcarstrends-ApiHighErrorRate"
Disable alarm
aws cloudwatch disable-alarm-actions
--alarm-names "sgcarstrends-ApiHighErrorRate"
Enable alarm
aws cloudwatch enable-alarm-actions
--alarm-names "sgcarstrends-ApiHighErrorRate"
Delete alarm
aws cloudwatch delete-alarms
--alarm-names "sgcarstrends-ApiHighErrorRate"
CloudWatch Insights
Querying Logs
Start query
aws logs start-query
--log-group-name "/aws/lambda/sgcarstrends-api-production"
--start-time $(date -u -d '1 hour ago' +%s)
--end-time $(date -u +%s)
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'
Get query results
aws logs get-query-results --query-id <query-id>
Common Queries
Find errors:
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20
API performance:
fields @timestamp, @duration | stats avg(@duration), max(@duration), min(@duration)
Count errors by type:
fields @message | filter @message like /ERROR/ | parse @message /(?<errorType>\w+Error)/ | stats count() by errorType
Slow requests:
fields @timestamp, @duration, @requestId | filter @duration > 1000 | sort @duration desc | limit 20
Request rate:
fields @timestamp | stats count() by bin(5m)
X-Ray Tracing
Enable X-Ray
// infra/api.ts import { StackContext, Function } from "sst/constructs"; import * as lambda from "aws-cdk-lib/aws-lambda";
export function API({ stack }: StackContext) { const api = new Function(stack, "api", { handler: "apps/api/src/index.handler", tracing: lambda.Tracing.ACTIVE, // Enable X-Ray });
return { api }; }
Instrument Code
// apps/api/src/index.ts import { captureAWSv3Client } from "aws-xray-sdk-core"; import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
// Wrap AWS SDK clients const client = captureAWSv3Client(new DynamoDBClient({}));
View Traces
Get service graph
aws xray get-service-graph
--start-time $(date -u -d '1 hour ago' +%s)
--end-time $(date -u +%s)
Get trace summaries
aws xray get-trace-summaries
--start-time $(date -u -d '1 hour ago' +%s)
--end-time $(date -u +%s)
Get trace details
aws xray batch-get-traces --trace-ids <trace-id>
Resource Monitoring
Lambda Functions
List functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, sgcarstrends)].FunctionName'
Get function config
aws lambda get-function-configuration
--function-name sgcarstrends-api-production
Get function code location
aws lambda get-function
--function-name sgcarstrends-api-production
Invoke function
aws lambda invoke
--function-name sgcarstrends-api-production
--payload '{"path": "/health"}'
response.json
cat response.json
CloudFront Distributions
List distributions
aws cloudfront list-distributions
--query 'DistributionList.Items[*].[Id,DomainName,Status]'
--output table
Get distribution config
aws cloudfront get-distribution-config --id <distribution-id>
Create invalidation (cache clear)
aws cloudfront create-invalidation
--distribution-id <distribution-id>
--paths "/*"
List invalidations
aws cloudfront list-invalidations --distribution-id <distribution-id>
S3 Buckets
List buckets
aws s3 ls
Get bucket size
aws s3 ls s3://bucket-name --recursive --summarize | grep "Total Size"
Monitor bucket metrics
aws cloudwatch get-metric-statistics
--namespace AWS/S3
--metric-name BucketSizeBytes
--dimensions Name=BucketName,Value=bucket-name Name=StorageType,Value=StandardStorage
--start-time $(date -u -d '1 day ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 86400
--statistics Average
Cost Monitoring
Cost Explorer
Get cost and usage
aws ce get-cost-and-usage
--time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d)
--granularity MONTHLY
--metrics BlendedCost
--group-by Type=SERVICE
Get cost by tag
aws ce get-cost-and-usage
--time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d)
--granularity MONTHLY
--metrics BlendedCost
--group-by Type=TAG,Key=Environment
Budget Alerts
Create budget in AWS Console or via CLI:
Create budget
aws budgets create-budget
--account-id $(aws sts get-caller-identity --query Account --output text)
--budget file://budget.json
--notifications-with-subscribers file://notifications.json
Debugging Production Issues
- Check Recent Deployments
Get stack events
aws cloudformation describe-stack-events
--stack-name sgcarstrends-api-production
--max-items 50
Get deployment status
npx sst stacks info API --stage production
- Check Logs for Errors
Get recent errors
npx sst logs --stage production --function api --filter "ERROR" --since 1h
Or use AWS CLI
aws logs tail "/aws/lambda/sgcarstrends-api-production"
--follow
--filter-pattern "ERROR"
- Check Metrics
Check invocations and errors
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=sgcarstrends-api-production
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)
--end-time $(date -u +%Y-%m-%dT%H:%M:%S)
--period 300
--statistics Sum
- Test Endpoint
Test API directly
curl -I https://api.sgcarstrends.com/health
Test with verbose output
curl -v https://api.sgcarstrends.com/health
- Check Resource Limits
Check Lambda quotas
aws service-quotas get-service-quota
--service-code lambda
--quota-code L-B99A9384 # Concurrent executions
Check API Gateway quotas
aws service-quotas list-service-quotas
--service-code apigateway
Common Issues
High Latency
Investigation:
-
Check Lambda duration metrics
-
Review CloudWatch Insights for slow queries
-
Check database connection pool
-
Review API response times
Solutions:
-
Increase Lambda memory
-
Optimize database queries
-
Add caching
-
Use connection pooling
High Error Rate
Investigation:
-
Check error logs
-
Review error types
-
Check external service status
-
Verify environment variables
Solutions:
-
Fix application bugs
-
Add error handling
-
Retry failed requests
-
Check API rate limits
Cold Starts
Investigation:
-
Check init duration
-
Review package size
-
Check provisioned concurrency
Solutions:
-
Enable provisioned concurrency
-
Reduce bundle size
-
Use ARM architecture
-
Optimize imports
Monitoring Scripts
Health Check Script
#!/bin/bash
scripts/health-check.sh
STAGE=${1:-production} API_URL="https://api${STAGE:+.$STAGE}.sgcarstrends.com"
echo "Checking health of $STAGE environment..."
Check API
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)
if [ $API_STATUS -eq 200 ]; then echo "✓ API is healthy" else echo "✗ API is down (status: $API_STATUS)" exit 1 fi
Check Web
WEB_URL="https://${STAGE:+$STAGE.}sgcarstrends.com" WEB_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $WEB_URL)
if [ $WEB_STATUS -eq 200 ]; then echo "✓ Web is healthy" else echo "✗ Web is down (status: $WEB_STATUS)" exit 1 fi
echo "All services are healthy!"
Run:
chmod +x scripts/health-check.sh ./scripts/health-check.sh production
Log Analysis Script
#!/bin/bash
scripts/analyze-logs.sh
STAGE=${1:-production} LOG_GROUP="/aws/lambda/sgcarstrends-api-$STAGE"
echo "Analyzing logs for $STAGE..."
Count errors in last hour
ERROR_COUNT=$(aws logs filter-log-events
--log-group-name $LOG_GROUP
--filter-pattern "ERROR"
--start-time $(date -u -d '1 hour ago' +%s)000
--query 'events[*].message'
--output text | wc -l)
echo "Errors in last hour: $ERROR_COUNT"
Get top errors
echo -e "\nTop error types:"
aws logs filter-log-events
--log-group-name $LOG_GROUP
--filter-pattern "ERROR"
--start-time $(date -u -d '1 hour ago' +%s)000
--query 'events[*].message'
--output text |
grep -oE '\w+Error' |
sort | uniq -c | sort -rn | head -5
References
-
CloudWatch Documentation: https://docs.aws.amazon.com/cloudwatch
-
Lambda Monitoring: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions.html
-
Related files:
-
infra/
-
Infrastructure with monitoring config
-
Root CLAUDE.md - Project documentation
Best Practices
-
Log Levels: Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
-
Structured Logging: Use JSON format for easier parsing
-
Correlation IDs: Track requests across services
-
Alarms: Set up alarms for critical metrics
-
Dashboards: Create CloudWatch dashboards for key metrics
-
Cost Monitoring: Track AWS costs regularly
-
Regular Reviews: Review logs and metrics weekly
-
Retention: Set appropriate log retention (7-30 days)