data-science-feature-engineering

Use this skill for creating, transforming, and selecting features that improve model performance.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-science-feature-engineering" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-science-feature-engineering

Feature Engineering

Use this skill for creating, transforming, and selecting features that improve model performance.

When to use this skill

  • After EDA — convert insights into features

  • Model underperforming — need better representations

  • Handling different data types (numerical, categorical, text, datetime)

  • Reducing dimensionality or selecting most predictive features

Feature engineering workflow

Numerical features

  • Scaling (StandardScaler, MinMaxScaler, RobustScaler)

  • Transformations (log, sqrt, Box-Cox for skewness)

  • Binning (equal-width, quantile, custom)

  • Interaction features

Categorical features

  • One-hot encoding (low cardinality)

  • Target/Mean encoding (high cardinality)

  • Ordinal encoding (ordered categories)

  • Frequency/rare category handling

Datetime features

  • Extract components (year, month, day, hour, dayofweek)

  • Cyclical encoding (sin/cos for time cycles)

  • Time since/duration features

Text features

  • TF-IDF, CountVectorizer

  • Embeddings (sentence-transformers)

  • Basic text stats (length, word count)

Feature selection

  • Filter methods (correlation, mutual information)

  • Wrapper methods (recursive feature elimination)

  • Embedded methods (L1 regularization, tree importance)

Quick tool selection

Task Default choice Notes

sklearn pipelines sklearn.pipeline + ColumnTransformer Reproducible, cross-validation safe

Categorical encoding category_encoders Beyond sklearn's limited options

Feature selection sklearn.feature_selection Mutual info, RFE, SelectFromModel

Text embeddings sentence-transformers Pre-trained semantic embeddings

Auto feature engineering Feature-engine Comprehensive transformations

Core implementation rules

  1. Use pipelines to prevent leakage

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([ ('num', StandardScaler(), numerical_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ])

pipeline = Pipeline([ ('prep', preprocessor), ('model', RandomForestClassifier()) ])

  1. Fit on train only, transform on all

Correct: fit_transform on train, transform on test

X_train_processed = preprocessor.fit_transform(X_train) X_test_processed = preprocessor.transform(X_test) # Only transform!

  1. Handle unknown categories

OneHotEncoder(handle_unknown='ignore') # Unknown → all zeros

OR

OneHotEncoder(handle_unknown='infrequent_if_exist') # Group rare/unknown

  1. Document feature importance

Track which features were created, why, and their expected impact.

Common anti-patterns

  • ❌ Fitting preprocessors on full dataset (leakage!)

  • ❌ One-hot encoding high-cardinality features (dimension explosion)

  • ❌ Ignoring feature scaling for distance-based models

  • ❌ Creating features without domain reasoning

  • ❌ Not validating feature distributions match between train/test

Progressive disclosure

  • ../references/categorical-encoding.md — Comprehensive encoding guide

  • ../references/datetime-features.md — Time-based feature patterns

  • ../references/text-features.md — NLP feature engineering

  • ../references/feature-selection.md — Selection strategies and implementations

Related skills

  • @data-science-eda — Understand data before engineering

  • @data-science-model-evaluation — Validate feature impact

  • @data-engineering-core — Data processing fundamentals

References

  • sklearn Preprocessing

  • category_encoders

  • Feature-engine Documentation

  • Sentence Transformers

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-notebooks

No summary provided by upstream source.

Repository SourceNeeds Review