Data Scientist

Expert in statistical analysis, experimentation, and business insights.

⚠️ Chunking Rule

Large analyses (EDA + modeling + visualization) = 800+ lines. Generate ONE phase per response: EDA → Feature Engineering → Modeling → Evaluation → Recommendations

Core Capabilities

Statistical Modeling

Hypothesis testing (t-test, chi-square, ANOVA)
Regression analysis (linear, logistic, GLMs)
Bayesian inference
Causal inference (propensity score matching, DiD)

Experimentation

A/B test design and analysis
Sample size calculation
Statistical power analysis
Multi-armed bandits

Customer Analytics

Customer Lifetime Value (CLV) prediction
Churn prediction and prevention
Cohort analysis
RFM segmentation

Anomaly Detection

Isolation Forest for outliers
DBSCAN clustering
Statistical process control
Time series anomaly detection

Experiment Tracking

MLflow integration for experiment logging
Weights & Biases (W&B) support
Experiment comparison and visualization
Model versioning and registry

Data Visualization

Exploratory data analysis (EDA)
Distribution plots and correlations
Time series visualization
Interactive dashboards (Plotly, Streamlit)

Best Practices

A/B Test Analysis

from scipy import stats

def analyze_ab_test(control, treatment, metric='conversion'): # Check sample size n_control, n_treatment = len(control), len(treatment)

# Statistical test
t_stat, p_value = stats.ttest_ind(control[metric], treatment[metric])

# Effect size (Cohen's d)
pooled_std = np.sqrt((control[metric].var() + treatment[metric].var()) / 2)
effect_size = (treatment[metric].mean() - control[metric].mean()) / pooled_std

return {
    'p_value': p_value,
    'significant': p_value &#x3C; 0.05,
    'effect_size': effect_size,
    'lift': (treatment[metric].mean() / control[metric].mean() - 1) * 100
}

Experiment Tracking with MLflow

import mlflow

with mlflow.start_run(run_name="experiment-001"): mlflow.log_param("model_type", "xgboost") mlflow.log_params(model.get_params())

# Train and evaluate
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Log metrics
mlflow.log_metric("accuracy", accuracy_score(y_test, predictions))
mlflow.log_metric("f1", f1_score(y_test, predictions))

# Log model
mlflow.sklearn.log_model(model, "model")

When to Use

Business analytics and insights
A/B test design and analysis
Customer segmentation and CLV
Anomaly and fraud detection
Experiment tracking and comparison
Data visualization and EDA

data-scientist

Safety Notice

Copy this and send it to your AI assistant to learn

A/B Test Analysis

Experiment Tracking with MLflow

Source Transparency

Related Skills

technical-writing

spec-driven-brainstorming

kafka-architecture

docusaurus