Statistical Modeling for Biomedical Data Analysis

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.

Features

Linear Regression - OLS for continuous outcomes with diagnostic tests
Logistic Regression - Binary, ordinal, and multinomial models with odds ratios
Survival Analysis - Cox proportional hazards and Kaplan-Meier curves
Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data
ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data
Model Diagnostics - Assumption checking, fit statistics, residual analysis
Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.

When to Use

Apply this skill when user asks:

"What is the odds ratio of X associated with Y?"
"What is the hazard ratio for treatment?"
"Fit a linear regression of Y on X1, X2, X3"
"Perform ordinal logistic regression for severity outcome"
"What is the Kaplan-Meier survival estimate at time T?"
"What is the percentage reduction in odds ratio after adjusting for confounders?"
"Run a mixed-effects model with random intercepts"
"Compute the interaction term between A and B"
"What is the F-statistic from ANOVA comparing groups?"
"Test if gene/miRNA expression differs across cell types"

Model Selection Decision Tree

START: What type of outcome variable?
|
+-- CONTINUOUS (height, blood pressure, score)
|   +-- Independent observations -> Linear Regression (OLS)
|   +-- Repeated measures -> Mixed-Effects Model (LMM)
|   +-- Count data -> Poisson/Negative Binomial
|
+-- BINARY (yes/no, disease/healthy)
|   +-- Independent observations -> Logistic Regression
|   +-- Repeated measures -> Logistic Mixed-Effects (GLMM/GEE)
|   +-- Rare events -> Firth logistic regression
|
+-- ORDINAL (mild/moderate/severe, stages I/II/III/IV)
|   +-- Ordinal Logistic Regression (Proportional Odds)
|
+-- MULTINOMIAL (>2 unordered categories)
|   +-- Multinomial Logistic Regression
|
+-- TIME-TO-EVENT (survival time + censoring)
    +-- Regression -> Cox Proportional Hazards
    +-- Survival curves -> Kaplan-Meier

Workflow

Phase 0: Data Validation

Goal: Load data, identify variable types, check for missing values.

CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
Examine available columns - List all columns in the dataset
Match question to data - Find the column that matches the described outcome
Verify outcome exists - Don't create outcome variables from predictors

Common mistake: Question mentions "obesity" -> Assumed outcome = BMI >= 30 (circular logic with BMI predictor). Always check data columns first: print(df.columns.tolist())

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(f"Observations: {len(df)}, Variables: {len(df.columns)}, Missing: {df.isnull().sum().sum()}")

for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: binary")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: categorical ({n_unique} levels)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: continuous (mean={df[col].mean():.2f})")

Phase 1: Model Fitting

Goal: Fit appropriate model based on outcome type.

Use the decision tree above to select model type, then refer to the appropriate reference file for detailed code:

Linear Regression: references/linear_models.md
Logistic Regression (binary): references/logistic_regression.md
Ordinal Logistic: references/ordinal_logistic.md
Cox Proportional Hazards: references/cox_regression.md
ANOVA / Statistical Tests: anova_and_tests.md

Quick reference for key models:

import statsmodels.formula.api as smf
import numpy as np

# Linear regression
model = smf.ols('outcome ~ predictor1 + predictor2', data=df).fit()

# Logistic regression (odds ratios)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

# Cox proportional hazards
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
hr = cph.hazard_ratios_['treatment']

Phase 1b: ANOVA for Multi-Feature Data

When data has multiple features (genes, miRNAs, metabolites), use per-feature ANOVA (not aggregate). This is the most common pattern in genomics.

See anova_and_tests.md for the full decision tree, both methods, and worked examples.

Default for gene expression data: Per-feature ANOVA (Method B).

Phase 2: Model Diagnostics

Goal: Check model assumptions and fit quality.

Key diagnostics by model type:

OLS: Shapiro-Wilk (normality), Breusch-Pagan (heteroscedasticity), VIF (multicollinearity)
Cox: Proportional hazards test via cph.check_assumptions()
Logistic: Hosmer-Lemeshow, ROC/AUC

See references/troubleshooting.md for diagnostic code and common issues.

Phase 3: Interpretation

Goal: Generate publication-quality summary.

For every result, report: effect size (OR/HR/coefficient), 95% CI, p-value, and model fit statistic. See bixbench_patterns_summary.md for common question-answer patterns.

Common BixBench Patterns

Pattern	Question Type	Key Steps
1	Odds ratio from ordinal regression	Fit OrderedModel, exp(coef)
2	Percentage reduction in OR	Compare crude vs adjusted model
3	Interaction effects	Fit `A * B`, extract `A:B` coef
4	Hazard ratio	Cox PH model, exp(coef)
5	Multi-feature ANOVA	Per-feature F-stats (not aggregate)

See bixbench_patterns_summary.md for solution code for each pattern. See references/bixbench_patterns.md for 15+ detailed question patterns.

Statsmodels vs Scikit-learn

Use Case	Library	Reason
Inference (p-values, CIs, ORs)	statsmodels	Full statistical output
Prediction (accuracy, AUC)	scikit-learn	Better prediction tools
Mixed-effects models	statsmodels	Only option
Regularization (LASSO, Ridge)	scikit-learn	Better optimization
Survival analysis	lifelines	Specialized library

General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

Key Principles

Data-first approach - Always inspect and validate data before modeling
Model selection by outcome type - Use decision tree above
Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
Confounder awareness - Adjust for confounders when specified or clinically relevant
Reproducible analysis - All code must be deterministic and reproducible
Robust error handling - Graceful handling of convergence failures, separation, collinearity
Round correctly - Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

File Structure

tooluniverse-statistical-modeling/
+-- SKILL.md                          # This file (workflow guide)
+-- QUICK_START.md                    # 8 quick examples
+-- EXAMPLES.md                       # Legacy examples
+-- TOOLS_REFERENCE.md                # ToolUniverse tool catalog
+-- anova_and_tests.md                # ANOVA decision tree and code
+-- bixbench_patterns_summary.md      # Common BixBench solution patterns
+-- test_skill.py                     # Test suite
+-- references/
|   +-- logistic_regression.md        # Detailed logistic examples
|   +-- ordinal_logistic.md           # Ordinal logit guide
|   +-- cox_regression.md             # Survival analysis guide
|   +-- linear_models.md              # OLS and mixed-effects
|   +-- bixbench_patterns.md          # 15+ question patterns
|   +-- troubleshooting.md            # Diagnostic issues
+-- scripts/
    +-- format_statistical_output.py  # Format results for reporting
    +-- model_diagnostics.py          # Automated diagnostics

ToolUniverse Integration

While this skill is primarily computational, ToolUniverse tools can provide data:

Use Case	Tools
Clinical trial data	`clinical_trials_search`
Drug safety outcomes	`FAERS_calculate_disproportionality`
Gene-disease associations	`OpenTargets_target_disease_evidence`
Biomarker data	`fda_pharmacogenomic_biomarkers`

See TOOLS_REFERENCE.md for complete tool catalog.

References

statsmodels: https://www.statsmodels.org/
lifelines: https://lifelines.readthedocs.io/
scikit-learn: https://scikit-learn.org/
Ordinal models: statsmodels.miscmodels.ordinal_model.OrderedModel

Support

For detailed examples and troubleshooting:

Logistic regression: references/logistic_regression.md
Ordinal models: references/ordinal_logistic.md
Survival analysis: references/cox_regression.md
Linear/mixed models: references/linear_models.md
BixBench patterns: references/bixbench_patterns.md
ANOVA and tests: anova_and_tests.md
Diagnostics: references/troubleshooting.md

tooluniverse-statistical-modeling

Safety Notice

Copy this and send it to your AI assistant to learn