data-science-eda

Exploratory Data Analysis (EDA)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-science-eda" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-science-eda

Exploratory Data Analysis (EDA)

Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.

When to use this skill

  • New dataset — need orientation on structure, types, distributions

  • Before feature engineering — understand variable relationships

  • Data quality investigation — find anomalies, missing patterns, outliers

  • Model preparation — validate assumptions about data

Core EDA workflow

  • Profile structure

  • Schema, types, cardinality

  • Missing value patterns

  • Analyze distributions

  • Numerical: histograms, boxplots, skewness

  • Categorical: frequencies, rare categories

  • Explore relationships

  • Correlation matrix (numerical)

  • Cross-tabulations (categorical)

  • Target-variable relationships

  • Identify issues

  • Outliers, duplicates, inconsistencies

  • Class imbalance (classification)

  • Temporal patterns (time series)

Quick tool selection

Task Default choice Notes

Automated profiling ydata-profiling / pandas-profiling Fast comprehensive reports

Interactive exploration ipywidgets + plotly Drill-down capability

Statistical tests scipy.stats Normality, correlations

Large datasets Polars + lazy Memory-efficient

Core implementation rules

  1. Start with automated profiling

import polars as pl from ydata_profiling import ProfileReport

df = pl.read_parquet("data.parquet") profile = ProfileReport(df.to_pandas(), title="Data Profile") profile.to_file("profile_report.html")

  1. Focus on actionable insights
  • Document outliers worth investigating (not all outliers are problems)

  • Flag features with high cardinality or rare categories

  • Note strong correlations that may cause multicollinearity

  1. Visualize for communication
  • Distribution plots for key variables

  • Correlation heatmap

  • Missing value patterns

  • Target relationship plots

  1. Validate assumptions
  • Check for expected ranges/business rules

  • Verify temporal consistency

  • Confirm key relationships match domain knowledge

Common anti-patterns

  • ❌ Skipping EDA and jumping to modeling

  • ❌ Treating all outliers as errors

  • ❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)

  • ❌ Over-plotting large datasets without sampling

  • ❌ Not documenting findings for team

Progressive disclosure

  • ../references/automated-profiling.md — ydata-profiling, Sweetviz, D-Tale

  • ../references/visualization-patterns.md — Matplotlib, Seaborn, Plotly patterns

  • ../references/statistical-tests.md — Scipy statistical tests guide

  • ../references/large-dataset-eda.md — Sampling, Polars, Dask approaches

Related skills

  • @data-science-feature-engineering — Next step after EDA

  • @data-science-model-evaluation — Validate modeling assumptions

  • @data-engineering-quality — Data validation frameworks

References

  • ydata-profiling Documentation

  • Pandas Visualization

  • Seaborn Statistical Visualization

  • SciPy Statistics

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-notebooks

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-storage-formats

No summary provided by upstream source.

Repository SourceNeeds Review