AWS FIS Experiment Execute
Verify that infrastructure is already deployed, run an AWS FIS experiment, monitor its progress, and generate a results report. Reads configuration from a prepared experiment directory whose CloudFormation stack has already been deployed.
Output Language Rule
Detect the language of the user's conversation and use the same language for all output.
- Chinese input -> Chinese output
- English input -> English output
Prerequisites
Required tools:
- AWS CLI —
aws fis,aws cloudwatch,aws cloudformation,aws logs - kubectl — configured with access to target EKS cluster.
- A prepared experiment directory (from aws-fis-experiment-prepare skill)
- The CloudFormation stack for this experiment must already be deployed
REQUIRED SUB-SKILL: app-service-log-analysis must be installed. Loaded at
runtime for application discovery, log collection, and analysis. Without it, the
experiment can still run but log analysis will be skipped.
Workflow
digraph execute_flow {
"User input:\npath or template ID?" [shape=diamond];
"Search CWD for\nmatching directory" [shape=box];
"Directory found?" [shape=diamond];
"Ask user for full path" [shape=box, style=bold];
"Validate files" [shape=box];
"Read README for stack name" [shape=box];
"Check CFN stack status" [shape=diamond];
"Extract template ID from outputs" [shape=box];
"Display actionIds" [shape=box];
"Pre-experiment health check" [shape=box, color=blue];
"All resources healthy?" [shape=diamond];
"Wait / prompt user" [shape=box];
"Discover apps + start logs\n(app-service-log-analysis)" [shape=box];
"User confirms experiment start" [shape=diamond, style=bold, color=red];
"Start experiment" [shape=box];
"Monitor experiment\n+ log insights" [shape=box];
"Experiment complete?" [shape=diamond];
"Wait 3 min post-baseline" [shape=box];
"Stop logs + analyze\n(app-service-log-analysis)" [shape=box];
"Generate results report" [shape=box];
"User input:\npath or template ID?" -> "Validate files" [label="Full path"];
"User input:\npath or template ID?" -> "Search CWD for\nmatching directory" [label="Template ID"];
"Search CWD for\nmatching directory" -> "Directory found?";
"Directory found?" -> "Validate files" [label="Yes (1 match)"];
"Directory found?" -> "Ask user for full path" [label="No match"];
"Ask user for full path" -> "Validate files" [label="User provides path"];
"Validate files" -> "Read README for stack name";
"Read README for stack name" -> "Check CFN stack status";
"Check CFN stack status" -> "Extract template ID from outputs" [label="CREATE_COMPLETE"];
"Check CFN stack status" -> "Generate results report" [label="Not deployed / failed, abort"];
"Extract template ID from outputs" -> "Display actionIds";
"Display actionIds" -> "Pre-experiment health check";
"Pre-experiment health check" -> "All resources healthy?";
"All resources healthy?" -> "Discover apps + start logs\n(app-service-log-analysis)" [label="Yes"];
"All resources healthy?" -> "Wait / prompt user" [label="No"];
"Wait / prompt user" -> "Pre-experiment health check" [label="Retry (poll 60s,\nmax 10 min non-interactive)"];
"Wait / prompt user" -> "Discover apps + start logs\n(app-service-log-analysis)" [label="User override"];
"Wait / prompt user" -> "Generate results report" [label="Abort"];
"Discover apps + start logs\n(app-service-log-analysis)" -> "User confirms experiment start";
"User confirms experiment start" -> "Start experiment" [label="Yes, I confirm"];
"User confirms experiment start" -> "Stop logs + analyze\n(app-service-log-analysis)" [label="No, abort"];
"Start experiment" -> "Monitor experiment\n+ log insights";
"Monitor experiment\n+ log insights" -> "Experiment complete?";
"Experiment complete?" -> "Monitor experiment\n+ log insights" [label="No, poll again"];
"Experiment complete?" -> "Wait 3 min post-baseline" [label="Yes"];
"Wait 3 min post-baseline" -> "Stop logs + analyze\n(app-service-log-analysis)";
"Stop logs + analyze\n(app-service-log-analysis)" -> "Generate results report";
}
Step 1: Resolve and Validate Experiment Directory
The user provides either:
- (a) the full path to the experiment directory, OR
- (b) an FIS experiment template ID (e.g.,
EXT1a2b3c4d5e6f7)
Step 1a: Resolve directory from template ID
If the user provides a template ID, search CWD for directories ending with that ID:
find . -maxdepth 1 -type d -name "*${TEMPLATE_ID_INPUT}" 2>/dev/null
- 1 match → use it, inform user
- Multiple matches → list and ask user to choose
- No match → ask user for full path. Do NOT proceed without a valid path.
Step 1b: Extract template ID from directory name
The experiment directory name ends with the template ID (e.g.,
2026-04-11-az-power-int-my-cluster-EXT1a2b3c4d5e6f7). Extract it:
TEMPLATE_ID=$(basename "${EXPERIMENT_DIR}" | grep -oE 'EXT[a-zA-Z0-9]+$')
Store as TEMPLATE_ID. This is used in all subsequent steps.
Step 1c: Validate required files
Verify EXPERIMENT_DIR contains: cfn-template.yaml, README.md.
Step 2: Read README and Extract Experiment Metadata
Read README.md from the experiment directory to extract:
- Scenario name — from the H1 heading (e.g.,
# FIS Experiment: AZ Power Interruption) - Target region — from
**Region:** {REGION} - Target AZ — from
**Target AZ:** {AZ_ID}(if applicable) - Estimated duration — from
**Estimated Duration:** {DURATION} - Affected resources — from the "Affected Resources" table
- CFN Stack Name — from
**CFN Stack:** {STACK_NAME}(for cleanup reference only)
Present a summary to the user with all extracted information.
Step 3: Display Experiment Actions
Use TEMPLATE_ID (extracted from directory name in Step 1b) to query the experiment
template via AWS CLI and display all action IDs:
aws fis get-experiment-template \
--id "{TEMPLATE_ID}" \
--region {REGION} \
--query 'experimentTemplate.actions' --output json
Extract all actionId values from the actions map and display them to the user:
Actions found:
- {actionId_1}
- {actionId_2}
...
Proceed directly to Step 3.5 (resource health check).
Step 3.5: Pre-Experiment Resource Health Check
Before starting log collection or the experiment itself, verify that every target resource referenced by the FIS experiment template is in a healthy baseline state. Starting an experiment against already-degraded resources makes results unattributable and may amplify impact on fragile infrastructure.
Scope: All resources listed in the FIS experiment template's targets map
(from the Step 3 query). This covers any managed service — RDS, Aurora, MSK,
ElastiCache (Redis/Memcached), EKS clusters and nodegroups, EC2, OpenSearch,
DocumentDB, etc. — whatever the template targets.
Procedure:
-
For each target in the experiment template, extract its
resourceType(e.g.aws:rds:db,aws:msk:cluster,aws:elasticache:replicationgroup) and the actual resource identifiers (fromresourceArnsor resolved fromresourceTags). -
For each resource, call the appropriate AWS describe API for its service and read the canonical status field. Use your knowledge of AWS services to pick the right API and the right "healthy" value (e.g. RDS
available, MSKACTIVE, ElastiCacheavailable, EKSACTIVE, EC2runningwith status checkok). -
Present a table to the user:
Resource Type Status Healthy? {id_1} {resourceType_1} {status_1} {✓ or ✗} {id_2} {resourceType_2} {status_2} {✓ or ✗} ... -
If the resource type is unfamiliar or the API call fails, mark the resource as unchecked and treat it as unhealthy for decision purposes.
Decision rules:
- All resources healthy → proceed to Step 4.
- One or more resources unhealthy / unchecked:
- Interactive session (you can prompt the user and get a response):
warn the user, list the problem resources with their current states, and wait
for explicit input. Accept:
proceed(override and continue),abort(stop the workflow), orretry(re-run the health check now). - Non-interactive session (no user available to respond): automatically poll every 60 seconds for up to 10 minutes. Recheck every target resource each cycle. If all resources become healthy within the window, continue to Step 4 automatically. If the 10-minute window expires with any resource still unhealthy, abort and output a diagnostic summary listing each unhealthy resource, its current state, and the duration of the wait.
- Interactive session (you can prompt the user and get a response):
warn the user, list the problem resources with their current states, and wait
for explicit input. Accept:
How to determine interactive vs non-interactive: Use your own judgment based on the runtime context (e.g. whether a TTY is attached, whether you can invoke interactive prompts, or environment signals suggesting a CI/automated run). When uncertain, default to interactive behavior.
Step 4: Discover EKS Applications and Start Log Collection
REQUIRED: You MUST use the skill tool to load the app-service-log-analysis
skill NOW, before proceeding. Call: skill(name="app-service-log-analysis").
This injects the skill's instructions into your context so you can execute its
steps. If the skill is not installed or cannot be loaded, inform the user and
skip log collection (the experiment can still run without it).
This step runs BEFORE the experiment starts — discovering applications after the experiment begins risks missing early log entries that get rotated or overwritten.
Execute from app-service-log-analysis skill:
- Its Step 2 (Read Service List) — extracts affected AWS services from the
experiment directory's
expected-behavior.md - Its Step 2.5 (Detect and Collect Managed Service Logs) — checks managed service CloudWatch logging status, records log groups for later analysis
- Its "Multi-Cluster EKS Discovery and Kubeconfig Isolation" section — discovers
all EKS clusters in the target region, generates isolated kubeconfig per cluster
(never overwrites
~/.kube/config), verifies access to each cluster - Its Step 3 (Collect Application Dependencies — Deep Scan) — resolves service endpoints, deep-scans all accessible clusters in parallel, confirms discovered dependencies with user
- Its Step 4 (Log Collection — Real-time Mode) — starts background
kubectl logs -ffor all confirmed applications across all clusters
Step 5: Start Experiment (CRITICAL CONFIRMATION)
This is the most dangerous step. The experiment WILL affect real resources.
Before starting, present a clear warning:
WARNING: Starting this FIS experiment will cause REAL impact:
Scenario: {SCENARIO_NAME}
Region: {REGION}
Target AZ: {AZ_ID}
Duration: {DURATION}
Stack: {STACK_NAME} (verified: CREATE_COMPLETE)
Template ID: {TEMPLATE_ID}
Resources that WILL be affected:
- {list each affected resource type and count from README}
Stop Conditions:
- {list each alarm that will stop the experiment}
Applications being monitored:
- {list each namespace/deployment from SERVICE_APP_MAP}
Managed service log collection:
- {list each service with logging status from MANAGED_LOG_GROUPS}
Log directory: {LOG_DIR}
Post-experiment baseline: 3 minutes (automatic)
Type "Yes, start experiment" to proceed, or "No" to abort.
Only proceed if the user explicitly confirms. If user aborts, proceed to Step 7 to stop log collection and clean up first.
Save the returned experiment.id.
Step 6: Monitor Experiment
Poll the experiment status and display progress. See references/cli-commands.md for
polling commands and experiment status reference.
Polling strategy:
- Poll every 30 seconds for the first 5 minutes
- Poll every 60 seconds after that
- Show current status after each poll
- Record timestamps for each status change and action state transition — these feed into the per-service timeline in the final report
- Track per-service events: For each service affected by the experiment, note when it was impacted (action started), when it recovered, and any intermediate states. Query service-specific status (e.g., RDS instance status, ElastiCache replication group status, EKS node status) during monitoring to capture detailed observations.
Log insights during each poll cycle: Execute app-service-log-analysis Step 5
(Real-time Monitoring Display) — read recent logs, count errors/warnings, display
per-app summary, detect recovery signals. The skill must already be loaded from Step 4.
During monitoring, remind the user:
- Check the CloudWatch dashboard for real-time metrics
- The experiment can be stopped at any time (see
references/cli-commands.mdfor stop command)
Step 7: Post-Experiment Baseline, Stop Log Collection and Analyze
After the experiment completes (any terminal state):
Post-Experiment Baseline (3 minutes)
Continue collecting logs for 3 minutes after the experiment ends to capture recovery behavior. This applies to both application logs and managed service logs. Display a countdown to the user:
Experiment completed. Collecting post-experiment baseline logs...
Remaining: {countdown} (3 minutes total)
After the 3-minute baseline window ends, proceed to analysis.
Generate Application Log Analysis
Execute app-service-log-analysis Steps 7-8:
- Its Step 7 (Generate Analysis Report) — analyze error patterns, peak rates, recovery times, and generate the "Application Log Analysis" section of the report. The analysis time window extends 3 minutes past the experiment end time to cover the baseline period.
- Its Step 8 (Cleanup) — kill background
kubectl logsprocesses
The application log analysis output is embedded into the experiment results report (see Step 10 below), NOT saved as a separate file.
Step 10: Save Results Report to Local File
After the experiment completes (any terminal state), generate a results report and write it directly to a local markdown file in the experiment directory.
See references/report-template.md for the complete report structure, file naming
convention, and timestamp format rules.
Per-service analysis: Identify all services affected by the experiment from the README's "Affected Resources" table. For each service, create a sub-section with: (1) timeline events, (2) observed behavior, (3) key findings. Include indirectly affected services.
After saving, print a brief terminal summary:
- File path, experiment ID, final status
- Start/end time and duration (ISO 8601 with timezone)
- Per-action status (one line each)
- Per-service recovery status (one line each)
- Application log summary — total errors per app
- Issues requiring attention (if any)
- Cleanup instructions
Safety Rules
- Never auto-start experiments. Always require explicit user confirmation.
- Show every CLI command before executing it.
- Display impact warning before experiment start with specific resource list.
- Provide abort instructions at every step.
- Never delete resources without user confirmation.
- Never deploy infrastructure. This skill only checks existing deployments.
- Recommend dry-run first — suggest the user review all files before starting.
- Never start an experiment against unhealthy resources. Step 3.5 verifies every target resource is in its service's healthy baseline state. In interactive mode, any unhealthy or unchecked resource requires explicit user override. In non-interactive mode, poll every 60 seconds for up to 10 minutes; abort if still unhealthy when the window expires.
Cleanup Guide
After the experiment, offer cleanup. See references/cli-commands.md for commands.
Error Handling
| Error | Cause | Resolution |
|---|---|---|
| Stack name not found in README | README missing **CFN Stack:** field | Check if the experiment was prepared with a recent version of aws-fis-experiment-prepare |
Stack not found (ValidationError) | Stack does not exist or was deleted | Deploy the stack first using aws-fis-experiment-prepare |
Stack in CREATE_FAILED / ROLLBACK_COMPLETE | Stack deployment failed | Check stack events for failure reason, fix and redeploy |
ExperimentTemplateId not in outputs | Stack template missing output | Check cfn-template.yaml for the output definition |
AccessDeniedException | Insufficient permissions | Check IAM permissions for FIS, CloudWatch, CloudFormation |
ResourceNotFoundException on targets | Tagged resources not found | Verify resource tags match experiment template |
Experiment stuck in initiating | IAM role propagation delay | Wait 30 seconds and check again |
kubectl: command not found | kubectl not installed | Install kubectl and configure kubeconfig |
error: You must be logged in | kubeconfig not configured | Run aws eks update-kubeconfig --name {cluster} |
/.pids: Permission denied | LOG_DIR variable empty due to && chain | Use multi-line script with export LOG_DIR=..., NOT && chains |
| No EKS apps discovered | No pods reference affected service endpoints | Ask user to manually specify namespace/deployment pairs |