q-exploratory-analysis

Q-Exploratory-Analysis

Universal exploratory data analysis (EDA) for tabular datasets. Previews the dataset, interviews the user to confirm column measurement levels, and applies statistically appropriate analysis for each variable type, producing a structured TABLE/ folder and a holistic EXPLORATORY_SUMMARY.md with flagged insights.

IMPORTANT: This skill requires Bash execution. The pre-built scripts/run_eda.py

must be copied from ${SKILL_DIR}/scripts/ — do NOT write a new script or inline Python. Run it (Phases 0-6) for CSVs and the Excel report, then write EXPLORATORY_SUMMARY.md yourself (Post-Script step) by reading the CSVs, consulting ${SKILL_DIR}/references/summary_template.md , and using the Write tool directly.

If in plan mode: write a brief plan — "Run q-exploratory-analysis skill: interview user for context and column types, deploy and execute run_eda.py, write EXPLORATORY_SUMMARY.md from generated CSVs." — then call ExitPlanMode immediately. Do NOT attempt Stages A/B, script deployment, or any analysis while plan mode is active.

Dependencies

pandas numpy scipy openpyxl # required for .xlsx input and Phase 6 Excel report

Install: pip install pandas numpy scipy openpyxl

0b. Script Deployment

Agent execution instructions:

Determine this SKILL.md file's directory path as SKILL_DIR .
Script path = ${SKILL_DIR}/scripts/run_eda.py .
Reference path = ${SKILL_DIR}/references/summary_template.md .

The skill ships a pre-built scripts/run_eda.py . Do NOT write a new script.

Deploy the script to the project before first run:

mkdir -p scripts cp "${SKILL_DIR}/scripts/run_eda.py" scripts/run_eda.py

Windows equivalent: mkdir scripts 2>nul & copy "%SKILL_DIR%\scripts\run_eda.py" scripts\run_eda.py

Or in PowerShell: New-Item -ItemType Directory -Force scripts; Copy-Item "$env:SKILL_DIR/scripts/run_eda.py" scripts/run_eda.py

If the project already has scripts/run_eda.py , verify it matches the skill version before running (the skill version is authoritative).

Requirements Gathering (Interview)

Before invoking the script, conduct a two-stage interview:

Stage A: Context Questions

Ask 2 questions before loading data:

Research questions - What are you exploring? (guides which comparisons matter)
Temporal column - Is there a date/time column for trend analysis?

Stage B: Column Classification Review

After the context questions, auto-detect column types and present for confirmation:

Load and preview - Run run_eda.py --preview to get df.head() , df.dtypes , df.nunique() . (Requires Bash — if in plan mode, see Plan Mode Guard above.)

Auto-classify - Apply the heuristic rules from Section 3 to generate suggested types. Pay special attention to:

High-cardinality numeric integers (views, revenue, duration) — these should be Continuous, not ID
Low-cardinality integers (1-5, 1-7 scales) — ask whether Ordinal (scale) or Discrete (count)
Columns with >40% missing — flag for user awareness

Present classification table - Show all columns organized by suggested type:

Column Sample Values Unique Suggested Type Notes

1 id 1, 2, 3 768 ID nunique > 95% of n, non-numeric

2 views 100, 5000, 1M 1885 Continuous high-cardinality integer metric

3 platform YouTube, Twitch 5 Nominal (Group) grouping variable

4 rating 1, 2, 3, 4, 5 5 Ordinal low-cardinality integer - scale or count?

... ... ... ... ... ...

Ask for confirmation - Use AskUserQuestion to present the table and ask: "If these all look correct, select Confirm all. Otherwise, select Corrections needed."

Record confirmed types and map to script arguments.

Invoke the script immediately — Do NOT write inline Python. Map confirmed types to --col_types pairs and run run_eda.py per Section 2.

Script Invocation

Standard invocation (built from interview):

python scripts/run_eda.py data.xlsx
--col_types rating=ordinal views=continuous description=text record_id=id
--group platform tier
--output TABLE/

Windows note: Use python (not python3 ). If the system has both Python 2 and 3, use py -3 or the full path (e.g., C:\Python312\python.exe ).

Arguments:

Argument Required Description

data

Yes Path to .xlsx or .csv

--col_types

No Confirmed types: col=type pairs. Valid: id, binary, nominal, ordinal, discrete, continuous, temporal, text. Unspecified columns auto-detected.

--group

No Nominal columns for grouped analysis

--output

No Output directory (default: TABLE/ )

--top_n

No Top-N for frequency tables (default: 10)

--no_excel

No Skip Phase 6 (Excel report)

--corr_deletion

No Missing-data strategy for correlations: pairwise (default) or listwise

--interactive

No Prompt for ambiguous integers (standalone CLI only)

Behavioral defaults:

--group omitted: grouping defaults to all nominal columns (no cap).
Cross-tabs: all nominal column pairs are generated (no cap).
ID detection: requires n > 10 rows; otherwise high-uniqueness columns stay as their detected numeric/string type.
LOW_CARD_MAX = 20 : integers with <= 20 unique values are flagged as ambiguous (ordinal vs. discrete).
--corr_deletion omitted: uses pairwise deletion (drops missing per pair, maximizing N). Use --corr_deletion listwise for consistent N across all pairs.

Column-Type Coverage

Claude presents suggested types during the interview; the user confirms or corrects before the script runs. Types follow Stevens' levels of measurement extended with Temporal, Text, and ID/key types.

Level Detection Rule Analysis Applied

ID/key nunique > 95% of n AND non-numeric dtype Excluded from analysis; flagged in profile

Binary Exactly 2 unique values (any dtype) Count, proportion, 95% CI

Nominal object/string dtype, avg length <= 50 chars Frequency table, mode, top-N

Ordinal User-confirmed low-cardinality integer Ordered freq table + cumulative % + full quantitative metrics (M labeled quasi-interval)

Discrete integer dtype, meaningful count/quantity Full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)

Continuous float dtype, OR numeric with nunique > 95% of n (n > 10) Full quantitative metrics + outlier flags + distribution shape

Temporal datetime dtype or parseable date string Monthly aggregated trend table (mean/median per period); gap detection not yet implemented

Text object dtype, avg length > 50 chars Top unigrams/bigrams, avg word count, vocabulary richness

Pipeline

Phase 0: Data Loading & Column Classification

Loads the file, applies user-confirmed column types (passed via --col_types ), auto-detects unambiguous types for remaining columns, and prints a schema summary: column name | detected type | nunique | missing%.

Phase 1: Dataset Profile

01_dataset_profile.csv

shape, column types, missing%, uniqueness.

Phase 2: Data Quality Assessment

02_data_quality.csv

missing counts/%, duplicate rows, constant columns, outlier counts per Continuous/Discrete column (IQR method).

Phase 3: Univariate Analysis

Measurement-level-appropriate analysis for each column:

03_nominal_frequencies.csv
frequency + % for each Nominal column (top-N)
04_binary_summary.csv
count, proportion, 95% CI for all Binary columns
05_ordinal_distribution.csv
ordered frequency + cumulative %
06_ordinal_descriptives.csv
full quantitative metrics (M quasi-interval, Mdn, SD, IQR, SE, 95% CI, skewness, kurtosis, outliers)
07_discrete_descriptives.csv
full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
08_continuous_descriptives.csv
full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)

Phase 4: Bivariate & Multivariate Analysis

Measurement-appropriate pairing analysis:

09_pearson_correlation.csv
Ratio-scale x Ratio-scale (Continuous + Discrete; r + p-value; pairwise deletion by default)
10_spearman_correlation.csv
Ordinal x Ordinal (rho + p-value; pairwise deletion by default)
11_grouped_by_{groupvar}.csv
Continuous/Discrete/Ordinal descriptives per Nominal group
12_crosstab_{nom1}x{nom2}.csv
Nominal x Nominal contingency tables

Phase 5: Specialized Analysis

13_text_{colname}.csv
word freq, top bigrams, avg word count, vocab size
14_temporal_trends.csv
key metrics over time by period

Phase 6: Excel Report

EXPLORATORY_REPORT.xlsx

APA-7th formatted workbook (B&W, no color fills). Sheet 0 "Summary" contains dataset dimensions and quality highlights. Sheets 1-N contain one sheet per generated CSV (skipped if absent). All sheets use Calibri 11 pt, table-number bold rows, title italic rows, bold column headers with bottom border, and data rows with top/bottom rules only. Skip with --no_excel .

Post-Script: Narrative Summary (Model-Authored)

IMPORTANT: This step is performed by Claude after the script finishes. EXPLORATORY_SUMMARY.md is produced by Claude reading the generated CSVs and writing a formatted document with the Write tool.

Before writing, read ${SKILL_DIR}/references/summary_template.md . This template shows the exact expected structure, table formats, and narrative style for each section. Use it as the structural blueprint — populate it with actual data from the generated CSVs. The template includes worked example rows for every table type.

Step 1 — Read the CSVs using the Read tool: 01_dataset_profile.csv , 02_data_quality.csv , 03_nominal_frequencies.csv , 04_binary_summary.csv , 05_ordinal_distribution.csv , 06_ordinal_descriptives.csv , 07_discrete_descriptives.csv , 08_continuous_descriptives.csv , 09_pearson_correlation.csv , 10_spearman_correlation.csv , all 11_grouped_by_.csv , all 12_crosstab_.csv , all 13_text_*.csv , 14_temporal_trends.csv .

Step 2 — Write TABLE/EXPLORATORY_SUMMARY.md directly using the Write tool.

Content requirements: One section per measurement level plus infrastructure sections. Required structure:

Dataset Overview (Source: 01_dataset_profile.csv , 14_temporal_trends.csv — temporal range and period count only)
Data Quality (Source: 02_data_quality.csv )
Nominal Variables (Source: 03_nominal_frequencies.csv )
Binary Variables (Source: 04_binary_summary.csv )
Ordinal Variables (Source: 05_ordinal_distribution.csv , 06_ordinal_descriptives.csv )
Discrete Variables (Source: 07_discrete_descriptives.csv )
Continuous Variables (Source: 08_continuous_descriptives.csv )
Bivariate Relationships (Source: 09_pearson_correlation.csv , 10_spearman_correlation.csv )
Group Comparisons (Source: 11_grouped_by_*.csv )
Cross-Tabulations (Source: 12_crosstab_*.csv )
Text Analysis (Source: 13_text_*.csv ; omit section if no text columns)
Temporal Trends (Source: 14_temporal_trends.csv ; omit section if no temporal column — full trend table and pattern interpretation here)
Output Files — bullet list of all files in the output directory

Table formatting rules:

Descriptive tables (ordinal 06 , discrete 07 , continuous 08 ): Split into two tables:
Core table: variable, N_valid, missing_count, missing_pct, M (or M_quasi_interval), Mdn, Mode, SD, IQR, Skewness, Kurtosis
Detail table: variable, Variance, Range, Min, Max, Q1_25th, Q3_75th, CV_pct, SE, CI95_lower, CI95_upper, mild_outliers_IQR1.5, extreme_outliers_IQR3.0, P10_10th, P90_90th (discrete/continuous only)
Narrow CSVs (frequencies 03 , binary 04 , ordinal distribution 05 , correlations 09 /10 , crosstabs 12 ): Use the same columns as the source CSV.
Frequency tables: Include all rows from the CSV (script already caps at top-N).
Correlation tables: Include all pairs; bold rows where abs(r/rho) > 0.7.
Cross-tabs: Include full table from CSV. If a crosstab has >10 columns, note the CSV as the authoritative source and provide a prose summary instead.
Do not round further than the CSV already rounds.

General rules:

End each section with 1–3 sentences of interpretation.
Every content section heading must include a Source: annotation citing the CSV file(s) it draws from. Infrastructure sections (e.g., Output Files) are exempt.
NEVER derive findings from ad-hoc Python. All findings must come from the generated CSVs.
Section omission: If all source CSVs for a section are absent, omit the section entirely. If some source CSVs exist but others are absent, include the section and note "No [subtype] data available (file skipped by script)" for the missing part.

Output Directory Reference

TABLE/ ├── 01_dataset_profile.csv ├── 02_data_quality.csv ├── 03_nominal_frequencies.csv ├── 04_binary_summary.csv ├── 05_ordinal_distribution.csv ├── 06_ordinal_descriptives.csv ├── 07_discrete_descriptives.csv ├── 08_continuous_descriptives.csv ├── 09_pearson_correlation.csv ├── 10_spearman_correlation.csv ├── 11_grouped_by_{groupvar}.csv (one per group variable) ├── 12_crosstab_{nom1}x{nom2}.csv (one per Nominal pair) ├── 13_text_{colname}.csv (one per text column) ├── 14_temporal_trends.csv ├── EXPLORATORY_REPORT.xlsx (omitted with --no_excel) └── EXPLORATORY_SUMMARY.md

Files are omitted when no columns of the relevant type exist (e.g., no text columns means no 13_text_*.csv ).

Interpreting the Summary

EXPLORATORY_SUMMARY.md uses a flagged-insight format rather than just reporting numbers:

Issue flags - "Column X has 34% missing - review before analysis"
Pattern highlights - "Strong positive correlation between views and likes (r=0.87, p<.001)"
Distribution notes - "Duration is right-skewed (skewness=3.1) - median better represents typical values"
Coverage summary - "768 records across 5 platforms; YouTube accounts for 62% of sample"
Measurement caveats - Ordinal means labeled as quasi-interval; Spearman used instead of Pearson for ranked scales

Flagging thresholds:

Flag Threshold Severity

Missing data

20% of rows review

Skewness abs(skew) > 2 distribution

Kurtosis abs(kurt) > 7 distribution

Correlation abs(r/rho) > 0.7 relationship

100% variability

ID-like column nunique > 95% of n exclusion

Design Principles

Exploratory-first - No confirmatory statistics; builds the picture before hypothesis testing
User-confirmed classification - Claude suggests measurement levels from heuristic rules; the user reviews and confirms before analysis runs
Measurement-appropriate - Uses median/IQR for Ordinal, Pearson for Continuous, Spearman for Ordinal, cross-tabulation for Nominal
Insight-flagging - Summary reports patterns and warnings, not just numbers
Dual output - CSVs for validation and import into results tables; markdown for interpretation
APA-compatible statistics - Full metric set (M, Mdn, SD, SE, 95% CI, skewness, kurtosis) ready for APA 7th reporting

Verification Checklist

Column classification table presented to user; user confirmed or corrected types
Confirmed types passed via --col_types and grouping columns via --group
Each detected column type has at least one output file
EXPLORATORY_SUMMARY.md written by Claude via Write tool
All 13 content sections present where applicable; sections for absent column types omitted per two-tier rule
Descriptive tables (ordinal/discrete/continuous) use core + detail split-table format
Every content section contains at least one populated markdown table and one narrative sentence. Infrastructure sections (e.g., Output Files) are exempt.
Numbers in the summary match the source CSVs exactly
EXPLORATORY_REPORT.xlsx present with "Summary" as first sheet + one sheet per CSV (unless --no_excel )
Excel workbook is B&W only (no color fills; no gridlines)
Bivariate outputs use correct method for each level pairing
High-cardinality numeric columns classified as Continuous, not ID (check 01_dataset_profile.csv )
Pearson correlations include both continuous and discrete ratio-scale columns
Every EXPLORATORY_SUMMARY.md section cites its source CSV file(s) in the heading
No ad-hoc Python used to derive findings outside the script pipeline
Script invoked with python (not python3 ) on Windows

Future Roadmap

Enhancement Rationale

Visualization suite Box plots, histograms, heatmaps, violin plots per measurement level

Normality tests (Shapiro-Wilk for n<5000) Formal distribution assessment for Continuous variables

Hypothesis generation section Bridge EDA to confirmatory research questions

ydata-profiling HTML report One-line automated profiling complement

Interval vs. Ratio distinction Full Stevens' framework for specific research contexts

q-exploratory-analysis

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

q_descriptive-analysis

q-educator

q-infographics

q-topic-finetuning