Exploratory Data Analysis (EDA)
Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.
When to use this skill
-
New dataset — need orientation on structure, types, distributions
-
Before feature engineering — understand variable relationships
-
Data quality investigation — find anomalies, missing patterns, outliers
-
Model preparation — validate assumptions about data
Core EDA workflow
-
Profile structure
-
Schema, types, cardinality
-
Missing value patterns
-
Analyze distributions
-
Numerical: histograms, boxplots, skewness
-
Categorical: frequencies, rare categories
-
Explore relationships
-
Correlation matrix (numerical)
-
Cross-tabulations (categorical)
-
Target-variable relationships
-
Identify issues
-
Outliers, duplicates, inconsistencies
-
Class imbalance (classification)
-
Temporal patterns (time series)
Quick tool selection
Task Default choice Notes
Automated profiling ydata-profiling / pandas-profiling Fast comprehensive reports
Interactive exploration ipywidgets + plotly Drill-down capability
Statistical tests scipy.stats Normality, correlations
Large datasets Polars + lazy Memory-efficient
Core implementation rules
- Start with automated profiling
import polars as pl from ydata_profiling import ProfileReport
df = pl.read_parquet("data.parquet") profile = ProfileReport(df.to_pandas(), title="Data Profile") profile.to_file("profile_report.html")
- Focus on actionable insights
-
Document outliers worth investigating (not all outliers are problems)
-
Flag features with high cardinality or rare categories
-
Note strong correlations that may cause multicollinearity
- Visualize for communication
-
Distribution plots for key variables
-
Correlation heatmap
-
Missing value patterns
-
Target relationship plots
- Validate assumptions
-
Check for expected ranges/business rules
-
Verify temporal consistency
-
Confirm key relationships match domain knowledge
Common anti-patterns
-
❌ Skipping EDA and jumping to modeling
-
❌ Treating all outliers as errors
-
❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
-
❌ Over-plotting large datasets without sampling
-
❌ Not documenting findings for team
Progressive disclosure
-
../references/automated-profiling.md — ydata-profiling, Sweetviz, D-Tale
-
../references/visualization-patterns.md — Matplotlib, Seaborn, Plotly patterns
-
../references/statistical-tests.md — Scipy statistical tests guide
-
../references/large-dataset-eda.md — Sampling, Polars, Dask approaches
Related skills
-
@data-science-feature-engineering — Next step after EDA
-
@data-science-model-evaluation — Validate modeling assumptions
-
@data-engineering-quality — Data validation frameworks
References
-
ydata-profiling Documentation
-
Pandas Visualization
-
Seaborn Statistical Visualization
-
SciPy Statistics