Q-Exploratory-Analysis
Universal exploratory data analysis (EDA) for tabular datasets. Previews the dataset, interviews the user to confirm column measurement levels, and applies statistically appropriate analysis for each variable type, producing a structured TABLE/ folder and a holistic EXPLORATORY_SUMMARY.md with flagged insights.
IMPORTANT: This skill requires Bash execution. The pre-built scripts/run_eda.py
must be copied from ${SKILL_DIR}/scripts/ — do NOT write a new script or inline Python. Run it (Phases 0-6) for CSVs and the Excel report, then write EXPLORATORY_SUMMARY.md yourself (Post-Script step) by reading the CSVs, consulting ${SKILL_DIR}/references/summary_template.md , and using the Write tool directly.
If in plan mode: write a brief plan — "Run q-exploratory-analysis skill: interview user for context and column types, deploy and execute run_eda.py, write EXPLORATORY_SUMMARY.md from generated CSVs." — then call ExitPlanMode immediately. Do NOT attempt Stages A/B, script deployment, or any analysis while plan mode is active.
- Dependencies
pandas numpy scipy openpyxl # required for .xlsx input and Phase 6 Excel report
Install: pip install pandas numpy scipy openpyxl
0b. Script Deployment
Agent execution instructions:
-
Determine this SKILL.md file's directory path as SKILL_DIR .
-
Script path = ${SKILL_DIR}/scripts/run_eda.py .
-
Reference path = ${SKILL_DIR}/references/summary_template.md .
The skill ships a pre-built scripts/run_eda.py . Do NOT write a new script.
Deploy the script to the project before first run:
mkdir -p scripts cp "${SKILL_DIR}/scripts/run_eda.py" scripts/run_eda.py
Windows equivalent: mkdir scripts 2>nul & copy "%SKILL_DIR%\scripts\run_eda.py" scripts\run_eda.py
Or in PowerShell: New-Item -ItemType Directory -Force scripts; Copy-Item "$env:SKILL_DIR/scripts/run_eda.py" scripts/run_eda.py
If the project already has scripts/run_eda.py , verify it matches the skill version before running (the skill version is authoritative).
- Requirements Gathering (Interview)
Before invoking the script, conduct a two-stage interview:
Stage A: Context Questions
Ask 2 questions before loading data:
-
Research questions - What are you exploring? (guides which comparisons matter)
-
Temporal column - Is there a date/time column for trend analysis?
Stage B: Column Classification Review
After the context questions, auto-detect column types and present for confirmation:
Load and preview - Run run_eda.py --preview to get df.head() , df.dtypes , df.nunique() . (Requires Bash — if in plan mode, see Plan Mode Guard above.)
Auto-classify - Apply the heuristic rules from Section 3 to generate suggested types. Pay special attention to:
-
High-cardinality numeric integers (views, revenue, duration) — these should be Continuous, not ID
-
Low-cardinality integers (1-5, 1-7 scales) — ask whether Ordinal (scale) or Discrete (count)
-
Columns with >40% missing — flag for user awareness
Present classification table - Show all columns organized by suggested type:
Column Sample Values Unique Suggested Type Notes
1 id 1, 2, 3 768 ID nunique > 95% of n, non-numeric
2 views 100, 5000, 1M 1885 Continuous high-cardinality integer metric
3 platform YouTube, Twitch 5 Nominal (Group) grouping variable
4 rating 1, 2, 3, 4, 5 5 Ordinal low-cardinality integer - scale or count?
... ... ... ... ... ...
Ask for confirmation - Use AskUserQuestion to present the table and ask: "If these all look correct, select Confirm all. Otherwise, select Corrections needed."
Record confirmed types and map to script arguments.
Invoke the script immediately — Do NOT write inline Python. Map confirmed types to --col_types pairs and run run_eda.py per Section 2.
- Script Invocation
Standard invocation (built from interview):
python scripts/run_eda.py data.xlsx
--col_types rating=ordinal views=continuous description=text record_id=id
--group platform tier
--output TABLE/
Windows note: Use python (not python3 ). If the system has both Python 2 and 3, use py -3 or the full path (e.g., C:\Python312\python.exe ).
Arguments:
Argument Required Description
data
Yes Path to .xlsx or .csv
--col_types
No Confirmed types: col=type pairs. Valid: id, binary, nominal, ordinal, discrete, continuous, temporal, text. Unspecified columns auto-detected.
--group
No Nominal columns for grouped analysis
--output
No Output directory (default: TABLE/ )
--top_n
No Top-N for frequency tables (default: 10)
--no_excel
No Skip Phase 6 (Excel report)
--corr_deletion
No Missing-data strategy for correlations: pairwise (default) or listwise
--interactive
No Prompt for ambiguous integers (standalone CLI only)
Behavioral defaults:
-
--group omitted: grouping defaults to all nominal columns (no cap).
-
Cross-tabs: all nominal column pairs are generated (no cap).
-
ID detection: requires n > 10 rows; otherwise high-uniqueness columns stay as their detected numeric/string type.
-
LOW_CARD_MAX = 20 : integers with <= 20 unique values are flagged as ambiguous (ordinal vs. discrete).
-
--corr_deletion omitted: uses pairwise deletion (drops missing per pair, maximizing N). Use --corr_deletion listwise for consistent N across all pairs.
- Column-Type Coverage
Claude presents suggested types during the interview; the user confirms or corrects before the script runs. Types follow Stevens' levels of measurement extended with Temporal, Text, and ID/key types.
Level Detection Rule Analysis Applied
ID/key nunique > 95% of n AND non-numeric dtype Excluded from analysis; flagged in profile
Binary Exactly 2 unique values (any dtype) Count, proportion, 95% CI
Nominal object/string dtype, avg length <= 50 chars Frequency table, mode, top-N
Ordinal User-confirmed low-cardinality integer Ordered freq table + cumulative % + full quantitative metrics (M labeled quasi-interval)
Discrete integer dtype, meaningful count/quantity Full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
Continuous float dtype, OR numeric with nunique > 95% of n (n > 10) Full quantitative metrics + outlier flags + distribution shape
Temporal datetime dtype or parseable date string Monthly aggregated trend table (mean/median per period); gap detection not yet implemented
Text object dtype, avg length > 50 chars Top unigrams/bigrams, avg word count, vocabulary richness
- Pipeline
Phase 0: Data Loading & Column Classification
Loads the file, applies user-confirmed column types (passed via --col_types ), auto-detects unambiguous types for remaining columns, and prints a schema summary: column name | detected type | nunique | missing%.
Phase 1: Dataset Profile
01_dataset_profile.csv
- shape, column types, missing%, uniqueness.
Phase 2: Data Quality Assessment
02_data_quality.csv
- missing counts/%, duplicate rows, constant columns, outlier counts per Continuous/Discrete column (IQR method).
Phase 3: Univariate Analysis
Measurement-level-appropriate analysis for each column:
-
03_nominal_frequencies.csv
-
frequency + % for each Nominal column (top-N)
-
04_binary_summary.csv
-
count, proportion, 95% CI for all Binary columns
-
05_ordinal_distribution.csv
-
ordered frequency + cumulative %
-
06_ordinal_descriptives.csv
-
full quantitative metrics (M quasi-interval, Mdn, SD, IQR, SE, 95% CI, skewness, kurtosis, outliers)
-
07_discrete_descriptives.csv
-
full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
-
08_continuous_descriptives.csv
-
full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
Phase 4: Bivariate & Multivariate Analysis
Measurement-appropriate pairing analysis:
-
09_pearson_correlation.csv
-
Ratio-scale x Ratio-scale (Continuous + Discrete; r + p-value; pairwise deletion by default)
-
10_spearman_correlation.csv
-
Ordinal x Ordinal (rho + p-value; pairwise deletion by default)
-
11_grouped_by_{groupvar}.csv
-
Continuous/Discrete/Ordinal descriptives per Nominal group
-
12_crosstab_{nom1}x{nom2}.csv
-
Nominal x Nominal contingency tables
Phase 5: Specialized Analysis
-
13_text_{colname}.csv
-
word freq, top bigrams, avg word count, vocab size
-
14_temporal_trends.csv
-
key metrics over time by period
Phase 6: Excel Report
EXPLORATORY_REPORT.xlsx
- APA-7th formatted workbook (B&W, no color fills). Sheet 0 "Summary" contains dataset dimensions and quality highlights. Sheets 1-N contain one sheet per generated CSV (skipped if absent). All sheets use Calibri 11 pt, table-number bold rows, title italic rows, bold column headers with bottom border, and data rows with top/bottom rules only. Skip with --no_excel .
Post-Script: Narrative Summary (Model-Authored)
IMPORTANT: This step is performed by Claude after the script finishes. EXPLORATORY_SUMMARY.md is produced by Claude reading the generated CSVs and writing a formatted document with the Write tool.
Before writing, read ${SKILL_DIR}/references/summary_template.md . This template shows the exact expected structure, table formats, and narrative style for each section. Use it as the structural blueprint — populate it with actual data from the generated CSVs. The template includes worked example rows for every table type.
Step 1 — Read the CSVs using the Read tool: 01_dataset_profile.csv , 02_data_quality.csv , 03_nominal_frequencies.csv , 04_binary_summary.csv , 05_ordinal_distribution.csv , 06_ordinal_descriptives.csv , 07_discrete_descriptives.csv , 08_continuous_descriptives.csv , 09_pearson_correlation.csv , 10_spearman_correlation.csv , all 11_grouped_by_.csv , all 12_crosstab_.csv , all 13_text_*.csv , 14_temporal_trends.csv .
Step 2 — Write TABLE/EXPLORATORY_SUMMARY.md directly using the Write tool.
Content requirements: One section per measurement level plus infrastructure sections. Required structure:
-
Dataset Overview (Source: 01_dataset_profile.csv , 14_temporal_trends.csv — temporal range and period count only)
-
Data Quality (Source: 02_data_quality.csv )
-
Nominal Variables (Source: 03_nominal_frequencies.csv )
-
Binary Variables (Source: 04_binary_summary.csv )
-
Ordinal Variables (Source: 05_ordinal_distribution.csv , 06_ordinal_descriptives.csv )
-
Discrete Variables (Source: 07_discrete_descriptives.csv )
-
Continuous Variables (Source: 08_continuous_descriptives.csv )
-
Bivariate Relationships (Source: 09_pearson_correlation.csv , 10_spearman_correlation.csv )
-
Group Comparisons (Source: 11_grouped_by_*.csv )
-
Cross-Tabulations (Source: 12_crosstab_*.csv )
-
Text Analysis (Source: 13_text_*.csv ; omit section if no text columns)
-
Temporal Trends (Source: 14_temporal_trends.csv ; omit section if no temporal column — full trend table and pattern interpretation here)
-
Output Files — bullet list of all files in the output directory
Table formatting rules:
-
Descriptive tables (ordinal 06 , discrete 07 , continuous 08 ): Split into two tables:
-
Core table: variable, N_valid, missing_count, missing_pct, M (or M_quasi_interval), Mdn, Mode, SD, IQR, Skewness, Kurtosis
-
Detail table: variable, Variance, Range, Min, Max, Q1_25th, Q3_75th, CV_pct, SE, CI95_lower, CI95_upper, mild_outliers_IQR1.5, extreme_outliers_IQR3.0, P10_10th, P90_90th (discrete/continuous only)
-
Narrow CSVs (frequencies 03 , binary 04 , ordinal distribution 05 , correlations 09 /10 , crosstabs 12 ): Use the same columns as the source CSV.
-
Frequency tables: Include all rows from the CSV (script already caps at top-N).
-
Correlation tables: Include all pairs; bold rows where abs(r/rho) > 0.7.
-
Cross-tabs: Include full table from CSV. If a crosstab has >10 columns, note the CSV as the authoritative source and provide a prose summary instead.
-
Do not round further than the CSV already rounds.
General rules:
-
End each section with 1–3 sentences of interpretation.
-
Every content section heading must include a Source: annotation citing the CSV file(s) it draws from. Infrastructure sections (e.g., Output Files) are exempt.
-
NEVER derive findings from ad-hoc Python. All findings must come from the generated CSVs.
-
Section omission: If all source CSVs for a section are absent, omit the section entirely. If some source CSVs exist but others are absent, include the section and note "No [subtype] data available (file skipped by script)" for the missing part.
- Output Directory Reference
TABLE/ ├── 01_dataset_profile.csv ├── 02_data_quality.csv ├── 03_nominal_frequencies.csv ├── 04_binary_summary.csv ├── 05_ordinal_distribution.csv ├── 06_ordinal_descriptives.csv ├── 07_discrete_descriptives.csv ├── 08_continuous_descriptives.csv ├── 09_pearson_correlation.csv ├── 10_spearman_correlation.csv ├── 11_grouped_by_{groupvar}.csv (one per group variable) ├── 12_crosstab_{nom1}x{nom2}.csv (one per Nominal pair) ├── 13_text_{colname}.csv (one per text column) ├── 14_temporal_trends.csv ├── EXPLORATORY_REPORT.xlsx (omitted with --no_excel) └── EXPLORATORY_SUMMARY.md
Files are omitted when no columns of the relevant type exist (e.g., no text columns means no 13_text_*.csv ).
- Interpreting the Summary
EXPLORATORY_SUMMARY.md uses a flagged-insight format rather than just reporting numbers:
-
Issue flags - "Column X has 34% missing - review before analysis"
-
Pattern highlights - "Strong positive correlation between views and likes (r=0.87, p<.001)"
-
Distribution notes - "Duration is right-skewed (skewness=3.1) - median better represents typical values"
-
Coverage summary - "768 records across 5 platforms; YouTube accounts for 62% of sample"
-
Measurement caveats - Ordinal means labeled as quasi-interval; Spearman used instead of Pearson for ranked scales
Flagging thresholds:
Flag Threshold Severity
Missing data
20% of rows review
Skewness abs(skew) > 2 distribution
Kurtosis abs(kurt) > 7 distribution
Correlation abs(r/rho) > 0.7 relationship
CV
100% variability
ID-like column nunique > 95% of n exclusion
- Design Principles
-
Exploratory-first - No confirmatory statistics; builds the picture before hypothesis testing
-
User-confirmed classification - Claude suggests measurement levels from heuristic rules; the user reviews and confirms before analysis runs
-
Measurement-appropriate - Uses median/IQR for Ordinal, Pearson for Continuous, Spearman for Ordinal, cross-tabulation for Nominal
-
Insight-flagging - Summary reports patterns and warnings, not just numbers
-
Dual output - CSVs for validation and import into results tables; markdown for interpretation
-
APA-compatible statistics - Full metric set (M, Mdn, SD, SE, 95% CI, skewness, kurtosis) ready for APA 7th reporting
- Verification Checklist
-
Column classification table presented to user; user confirmed or corrected types
-
Confirmed types passed via --col_types and grouping columns via --group
-
Each detected column type has at least one output file
-
EXPLORATORY_SUMMARY.md written by Claude via Write tool
-
All 13 content sections present where applicable; sections for absent column types omitted per two-tier rule
-
Descriptive tables (ordinal/discrete/continuous) use core + detail split-table format
-
Every content section contains at least one populated markdown table and one narrative sentence. Infrastructure sections (e.g., Output Files) are exempt.
-
Numbers in the summary match the source CSVs exactly
-
EXPLORATORY_REPORT.xlsx present with "Summary" as first sheet + one sheet per CSV (unless --no_excel )
-
Excel workbook is B&W only (no color fills; no gridlines)
-
Bivariate outputs use correct method for each level pairing
-
High-cardinality numeric columns classified as Continuous, not ID (check 01_dataset_profile.csv )
-
Pearson correlations include both continuous and discrete ratio-scale columns
-
Every EXPLORATORY_SUMMARY.md section cites its source CSV file(s) in the heading
-
No ad-hoc Python used to derive findings outside the script pipeline
-
Script invoked with python (not python3 ) on Windows
Future Roadmap
Enhancement Rationale
Visualization suite Box plots, histograms, heatmaps, violin plots per measurement level
Normality tests (Shapiro-Wilk for n<5000) Formal distribution assessment for Continuous variables
Hypothesis generation section Bridge EDA to confirmatory research questions
ydata-profiling HTML report One-line automated profiling complement
Interval vs. Ratio distinction Full Stevens' framework for specific research contexts