Descriptive Analysis Skill
Generate comprehensive exploratory descriptive analysis of tabular datasets with grouped statistics, frequency tables, entity extraction, and publication-ready markdown summaries.
Workflow
- Requirements Gathering (Interview)
Before analysis, ask the user 5-10 questions covering:
-
Research objective - Exploratory vs. confirmatory analysis
-
Grouping variables - Categorical variables to stratify by
-
Continuous variables - Metrics to calculate descriptives for
-
Text fields requiring extraction - Columns with embedded entities
-
Temporal variable - Date/time column and desired granularity
-
Classification schemes - Any custom tier/category definitions
-
Output preferences - CSV tables, MD summary, visualizations
- Data Preparation
Create derived variables as needed:
Tier classification (customize thresholds)
def classify_tier(value, tiers): for tier_name, (min_val, max_val) in tiers.items(): if min_val <= value <= max_val: return tier_name return 'Other'
Example tier structure
TIERS = { 'Small': (0, 1000), 'Medium': (1001, 10000), 'Large': (10001, float('inf')) }
Temporal grouping
df['month'] = df['date_col'].dt.to_period('M').astype(str)
- Analysis Structure
Generate tables in this order:
File Pattern Contents
01 sample_overview.csv N, date range, unique counts
02-07 {groupvar}_distribution.csv Frequency for each grouping variable
08 continuous_overall.csv Mean, SD, Median, Min, Max
08a-f continuous_by_{groupvar}.csv Descriptives stratified by group
09-10 categorical_distribution.csv Key categorical variables
11-15 entity_{fieldname}.csv Extracted entity frequencies
16 temporal_trends.csv Metrics over time
- Descriptive Statistics Function
def descriptive_stats(series, name='Variable'): return { 'Variable': name, 'N': series.count(), 'Mean': series.mean(), 'SD': series.std(), 'Min': series.min(), 'Q1': series.quantile(0.25), 'Median': series.median(), 'Q3': series.quantile(0.75), 'Max': series.max() }
def grouped_descriptives(df, var, group_var, group_col_name): results = [] for group in df[group_var].dropna().unique(): group_data = df[df[group_var] == group][var].dropna() if len(group_data) > 0: stats = descriptive_stats(group_data, var) stats[group_col_name] = group results.append(stats) return pd.DataFrame(results)
- Entity Extraction
For text fields with embedded entities (timestamps, names, etc.):
import re
def extract_entities(text): """Extract entities from bracketed text like '[00:01:23] entity_name'""" if pd.isna(text) or text == '': return [] entities = [] pattern = r'[[\d:]+]\s*([^;[]]+)' matches = re.findall(pattern, str(text)) for match in matches: entity = match.strip().lower() if entity and len(entity) > 1: entities.append(entity) return entities
def entity_frequency(df, col): all_entities = [] for text in df[col].dropna(): all_entities.extend(extract_entities(text)) return pd.Series(all_entities).value_counts()
- Output Directory Structure
TABLE/ ├── 01_sample_overview.csv ├── 02_groupvar1_distribution.csv ├── ... ├── 08_continuous_overall.csv ├── 08a_continuous_by_groupvar1.csv ├── ... ├── DESCRIPTIVE_SUMMARY.md
- MD Summary Generator
Create comprehensive markdown summary including:
-
Sample Overview - Dataset dimensions and date range
-
Distribution Tables - Top values for each grouping variable
-
Continuous Descriptives - Overall + by each grouping variable
-
Entity Summaries - Unique counts and top entities
-
Temporal Trends - Key metrics over time
-
Output Files Reference - Links to all CSV tables
Summary should use markdown tables with proper formatting:
| Variable | N | Mean | SD | Median |
|---|---|---|---|---|
| views | 380 | 27192.59 | 133894.14 | 657.00 |
- Key Design Principles
-
Descriptive only - No inferential statistics unless requested
-
Flexible grouping - Support any number of grouping variables
-
Top-N limits - Show top 5-10 for large category sets
-
Clean entity extraction - Normalize case, deduplicate
-
Dual output - CSV for validation, MD for interpretation
-
Video/channel counts - When applicable, report both unit types
-
Milestone annotations - Add context to temporal distributions
- Verification Checklist
-
All CSV files generated with > 0 rows
-
No empty/null columns
-
Sum of frequencies matches total N
-
Grouped descriptives align with overall
-
Entity extraction capturing expected patterns
-
MD summary coherent and complete