ecs-troubleshooting

ECS troubleshooting and debugging guide covering task failures, service issues, networking problems, and performance diagnostics. Use when diagnosing ECS issues, debugging task failures (STOPPED, PENDING), resolving networking problems, investigating IAM/permissions errors, troubleshooting container health checks, or analyzing ECS service health.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ecs-troubleshooting" with this command: npx skills add adaptationio/skrillz/adaptationio-skrillz-ecs-troubleshooting

ECS Troubleshooting Guide

Complete guide to diagnosing and resolving common ECS issues.

Quick Diagnostic Commands

# Check service status
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

# List stopped tasks (failures)
aws ecs list-tasks \
  --cluster production \
  --service-name my-service \
  --desired-status STOPPED

# Describe stopped task
aws ecs describe-tasks \
  --cluster production \
  --tasks <task-arn> \
  --query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# View recent logs
aws logs tail /ecs/my-app --since 1h --follow

# Execute into container (debug)
aws ecs execute-command \
  --cluster production \
  --task <task-id> \
  --container my-app \
  --interactive \
  --command "/bin/sh"

Task Failures

Task Status: STOPPED

Symptom

Tasks immediately stop after starting or fail to start.

Diagnostic Steps

import boto3

ecs = boto3.client('ecs')

def diagnose_stopped_task(cluster: str, task_arn: str):
    """Diagnose why a task stopped"""

    response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
    task = response['tasks'][0]

    print(f"Task Status: {task['lastStatus']}")
    print(f"Stop Code: {task.get('stopCode', 'N/A')}")
    print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")

    for container in task['containers']:
        print(f"\nContainer: {container['name']}")
        print(f"  Status: {container['lastStatus']}")
        print(f"  Exit Code: {container.get('exitCode', 'N/A')}")
        print(f"  Reason: {container.get('reason', 'N/A')}")

Common Causes & Solutions

1. Essential container failed

stoppedReason: "Essential container in task exited"

Solution: Check container logs for application errors

aws logs tail /ecs/my-app --since 30m

2. Task failed to start

stoppedReason: "Task failed to start"

Solution: Check execution role permissions

# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access

3. CannotPullContainerError

reason: "CannotPullContainerError: Error response from daemon"

Solutions:

  • Check ECR permissions in execution role
  • Verify image exists: aws ecr describe-images --repository-name my-app
  • Check VPC endpoints or NAT gateway for private subnets

4. OutOfMemoryError

reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137

Solution: Increase memory in task definition

memory = 2048  # Increase from current value

5. Exit Code 1 (Application Error)

exitCode: 1

Solution: Check application logs for errors

aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "ERROR"

Task Status: PENDING

Symptom

Tasks stuck in PENDING state, not transitioning to RUNNING.

Diagnostic Steps

def diagnose_pending_tasks(cluster: str, service: str):
    """Check why tasks are stuck in PENDING"""

    # List pending tasks
    pending = ecs.list_tasks(
        cluster=cluster,
        serviceName=service,
        desiredStatus='RUNNING'
    )

    for task_arn in pending['taskArns']:
        task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]

        if task['lastStatus'] == 'PENDING':
            print(f"Task {task_arn.split('/')[-1]} is PENDING")

            # Check attachments for ENI issues
            for attachment in task.get('attachments', []):
                print(f"  Attachment: {attachment['type']} - {attachment['status']}")
                for detail in attachment.get('details', []):
                    print(f"    {detail['name']}: {detail['value']}")

Common Causes & Solutions

1. No available capacity

Service my-service was unable to place a task because no container instance met all of its requirements

Solutions for Fargate:

  • Check capacity provider limits
  • Verify subnet has available IPs
  • Check if region/AZ has Fargate capacity

2. ENI provisioning issues

Attachment status: PRECREATED

Solutions:

  • Check security group allows required traffic
  • Verify subnet has available IPs
  • Check ENI limits for EC2 instances

3. Image pull taking too long

Container image: pulling

Solutions:

  • Check image size (use smaller base images)
  • Verify network connectivity to ECR
  • Use VPC endpoints for faster pulls

Service Issues

Service Not Starting Tasks

Diagnostic

# Check service events
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].events[:10]'

Common Events & Solutions

1. "service my-service is unable to place a task"

Check task placement constraints and capacity.

2. "service my-service has reached a steady state"

Service is healthy - tasks are running as expected.

3. "service my-service was unable to place a task because no container instance met all requirements"

For Fargate: Check CPU/memory configurations are valid combinations.

Deployment Stuck

Symptom

Deployment never reaches COMPLETED state.

Diagnostic

def check_deployment_status(cluster: str, service: str):
    """Check deployment progress"""

    response = ecs.describe_services(cluster=cluster, services=[service])
    svc = response['services'][0]

    for deployment in svc['deployments']:
        print(f"\nDeployment: {deployment['id']}")
        print(f"  Status: {deployment['status']}")
        print(f"  Rollout State: {deployment['rolloutState']}")
        print(f"  Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")

        if deployment['rolloutState'] == 'IN_PROGRESS':
            reason = deployment.get('rolloutStateReason', '')
            print(f"  Reason: {reason}")

Common Causes

1. Health check failures

rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"

Solutions:

  • Check target group health check settings
  • Increase healthCheckGracePeriodSeconds
  • Verify application responds on health check path

2. Insufficient capacity

rolloutStateReason: "Service my-service was unable to place a task"

Solutions:

  • Check subnet IP availability
  • Reduce maximumPercent to allow more headroom

Networking Issues

Tasks Cannot Connect to Internet

Symptoms

  • Cannot pull images
  • Cannot reach external APIs
  • Timeouts on external calls

Solutions

For private subnets:

# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id
}

# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type = "Interface"
  subnet_ids        = aws_subnet.private[*].id
}

Tasks Cannot Connect to Each Other

Symptom

Service-to-service communication fails.

Diagnostic

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids sg-12345 \
  --query 'SecurityGroups[0].IpPermissions'

Solutions

# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.ecs_tasks.id
}

Load Balancer Health Checks Failing

Symptom

Target group app-tg: 0 healthy, 3 unhealthy

Diagnostic

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Common Causes & Solutions

1. Wrong health check path

health_check {
  path = "/health"  # Must match application endpoint
}

2. Container not listening on expected port

# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
  --container my-app --interactive --command "netstat -tlnp"

3. Security group blocking ALB

# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.alb.id
}

IAM & Permissions Issues

CannotPullContainerError

Symptom

CannotPullContainerError: Error response from daemon: pull access denied

Solution: Task Execution Role

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ]
      Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
    }]
  })
}

Secrets Access Denied

Symptom

ResourceInitializationError: unable to pull secrets

Solution

resource "aws_iam_role_policy" "secrets_access" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["ssm:GetParameters"]
        Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["kms:Decrypt"]
        Resource = aws_kms_key.secrets.arn
      }
    ]
  })
}

Execute Command Not Working

Symptom

SessionManagerPlugin is not found

or

Execute command is disabled

Solutions

1. Enable execute command on service

resource "aws_ecs_service" "app" {
  enable_execute_command = true
}

2. Add SSM permissions to task role

resource "aws_iam_role_policy" "ssm_exec" {
  role = aws_iam_role.ecs_task.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ]
      Resource = "*"
    }]
  })
}

Performance Issues

High CPU/Memory Usage

Diagnostic

import boto3

cloudwatch = boto3.client('cloudwatch')

def get_service_metrics(cluster: str, service: str):
    """Get CPU and memory metrics"""

    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/ECS',
        MetricName='CPUUtilization',
        Dimensions=[
            {'Name': 'ClusterName', 'Value': cluster},
            {'Name': 'ServiceName', 'Value': service}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=1),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average', 'Maximum']
    )

    for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
        print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")

Solutions

1. Right-size tasks

# Increase resources
cpu    = "1024"  # from 512
memory = "2048"  # from 1024

2. Enable auto-scaling

resource "aws_appautoscaling_policy" "cpu" {
  target_tracking_scaling_policy_configuration {
    target_value = 70.0
  }
}

Slow Task Startup

Causes & Solutions

1. Large container image

  • Use smaller base images (alpine, distroless)
  • Enable image caching with Fargate Platform 1.4.0

2. Slow application startup

  • Increase startPeriod in health check
  • Optimize application initialization

3. Slow secret/config loading

  • Use VPC endpoints for faster access
  • Cache configuration at startup

Log Analysis

CloudWatch Logs Queries

# Find errors in last hour
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --start-time $(date -d '-1 hour' +%s000) \
  --filter-pattern "ERROR"

# Find OOM kills
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "OutOfMemory"

# Find slow requests
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "[timestamp, level, duration>1000, ...]"

CloudWatch Insights

-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10

-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)

Related Skills

  • boto3-ecs: SDK patterns
  • terraform-ecs: Infrastructure as Code
  • ecs-fargate: Fargate specifics
  • ecs-deployment: Deployment strategies

Quick Reference

SymptomFirst CheckCommon Cause
Task STOPPEDstoppedReasonContainer crash, OOM
Task PENDINGAttachmentsENI/network issues
Deployment stuckHealth checksALB health check failing
Cannot pull imageExecution roleMissing ECR permissions
Cannot connectSecurity groupsWrong SG rules

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

finnhub-api

No summary provided by upstream source.

Repository SourceNeeds Review
General

auto-updater

No summary provided by upstream source.

Repository SourceNeeds Review
General

todo-management

No summary provided by upstream source.

Repository SourceNeeds Review