Senior Data Scientist
Expert data science for statistical modeling, experimentation, ML deployment, and data-driven decision making.
Keywords
data-science, machine-learning, statistics, a-b-testing, causal-inference, feature-engineering, mlops, experiment-design, model-deployment, python, scikit-learn, pytorch, tensorflow, spark, airflow
Quick Start
Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/
Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze
Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy
Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json
Tools
Script Purpose
scripts/experiment_designer.py
A/B test design, power analysis, sample size calculation
scripts/feature_engineering_pipeline.py
Automated feature generation, correlation analysis, feature selection
scripts/statistical_analyzer.py
Hypothesis testing, causal inference, regression analysis
scripts/model_evaluation_suite.py
Model comparison, cross-validation, deployment readiness checks
Tech Stack
Category Tools
Languages Python, SQL, R, Scala
ML Frameworks PyTorch, TensorFlow, Scikit-learn, XGBoost
Data Processing Spark, Airflow, dbt, Kafka, Databricks
Deployment Docker, Kubernetes, AWS SageMaker, GCP Vertex AI
Experiment Tracking MLflow, Weights & Biases
Databases PostgreSQL, BigQuery, Snowflake, Pinecone
Workflow 1: Design and Analyze an A/B Test
-
Define hypothesis -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
-
Calculate sample size -- python scripts/experiment_designer.py --input data/ --output results/
-
Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
-
Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
-
Randomize assignment -- Use hash-based assignment on user ID for deterministic, reproducible splits.
-
Run experiment -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.
-
Analyze results: from scipy import stats
Two-proportion z-test for conversion rates
control_conv = control_successes / control_total treatment_conv = treatment_successes / treatment_total z_stat, p_value = stats.proportions_ztest( [treatment_successes, control_successes], [treatment_total, control_total], alternative='two-sided' )
Reject H0 if p_value < 0.05
- Validate -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.
Workflow 2: Build a Feature Engineering Pipeline
-
Profile raw data -- python scripts/feature_engineering_pipeline.py --target project/ --analyze
-
Identify null rates, cardinality, distributions, and data types.
-
Generate candidate features:
-
Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
-
Aggregation: rolling means/sums over 7d/30d/90d windows
-
Interaction: ratio features, polynomial combinations
-
Text: TF-IDF, embedding vectors
-
Select features -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
-
Validate -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
-
Register -- Store features in feature store with versioning and lineage metadata.
Workflow 3: Train and Evaluate a Model
-
Split data -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
-
Train baseline -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
-
Tune hyperparameters -- Use Optuna or cross-validated grid search. Log all runs to MLflow.
-
Evaluate on held-out test set: from sklearn.metrics import classification_report, roc_auc_score
y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred)) print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
- Validate -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).
Workflow 4: Deploy a Model to Production
-
Containerize -- Package model with inference dependencies in Docker: docker build -t model-service:v1 .
-
Set up serving -- Deploy behind a REST API with health check, input validation, and structured error responses.
-
Configure monitoring:
-
Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
-
Output drift: monitor prediction distribution shifts
-
Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
-
Enable canary deployment -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
-
Validate -- python scripts/model_evaluation_suite.py --config config.yaml --deploy confirms serving latency, error rate <0.1%, and model outputs match offline evaluation.
Workflow 5: Perform Causal Inference
-
Assess assignment mechanism -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
-
Select method based on data structure:
-
Propensity Score Matching: when treatment is binary, many covariates available
-
Difference-in-Differences: when pre/post data available for treatment and control groups
-
Regression Discontinuity: when treatment assigned by threshold on running variable
-
Instrumental Variables: when unobserved confounding present but valid instrument exists
-
Check assumptions -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
-
Estimate treatment effect and compute confidence intervals.
-
Validate -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.
Performance Targets
Metric Target
P50 latency < 50ms
P95 latency < 100ms
P99 latency < 200ms
Throughput
1,000 req/s
Availability 99.9%
Error rate < 0.1%
Common Commands
Development
python -m pytest tests/ -v --cov python -m black src/ python -m pylint src/
Training
python scripts/train.py --config prod.yaml python scripts/evaluate.py --model best.pth
Deployment
docker build -t service:v1 . kubectl apply -f k8s/ helm upgrade service ./charts/
Monitoring
kubectl logs -f deployment/service python scripts/health_check.py
Reference Documentation
Document Path
Statistical Methods references/statistical_methods_advanced.md
Experiment Design Frameworks references/experiment_design_frameworks.md
Feature Engineering Patterns references/feature_engineering_patterns.md
Automation Scripts scripts/ directory