RNA-seq Differential Expression Analysis (DESeq2)
Differential expression analysis of RNA-seq count data using PyDESeq2, with enrichment analysis (gseapy) and gene annotation via ToolUniverse.
BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects.
Core Principles
- Data-first - Load and validate count data and metadata BEFORE any analysis
- Statistical rigor - Proper normalization, dispersion estimation, multiple testing correction
- Flexible design - Single-factor, multi-factor, and interaction designs
- Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean)
- Reproducible - Set random seeds, document all parameters
- Question-driven - Parse what the user is actually asking; extract the specific answer
- Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested
When to Use
- RNA-seq count matrices needing differential expression analysis
- DESeq2, DEGs, padj, log2FC questions
- Dispersion estimates or diagnostics
- GO, KEGG, Reactome enrichment on DEGs
- Specific gene expression changes between conditions
- Batch effect correction in RNA-seq
Required Packages
import pandas as pd, numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import gseapy as gp # enrichment (optional)
from tooluniverse import ToolUniverse # annotation (optional)
Analysis Workflow
Step 1: Parse the Question
Extract: data files, thresholds (padj/log2FC/baseMean), design factors, contrast, direction, enrichment type, specific genes. See question_parsing.md.
Step 2: Load & Validate Data
Load counts + metadata, ensure samples-as-rows/genes-as-columns, verify integer counts, align sample names, remove zero-count genes. See data_loading.md.
Step 2.5: Inspect Metadata (REQUIRED)
List ALL metadata columns and levels. Categorize as biological interest vs batch/block. Build design formula with covariates first, factor of interest last. See design_formula_guide.md.
Step 3: Run PyDESeq2
Set reference level via pd.Categorical, create DeseqDataSet, call dds.deseq2(), extract DeseqStats with contrast, run Wald test, optionally apply LFC shrinkage. See pydeseq2_workflow.md.
Tool boundaries:
- Python (PyDESeq2): ALL DESeq2 analysis
- ToolUniverse: ONLY gene annotation (ID conversion, pathway context)
- gseapy: Enrichment analysis (GO/KEGG/Reactome)
Step 4: Filter Results
Apply padj, log2FC, baseMean thresholds. Split by direction if needed. See result_filtering.md.
Step 5: Dispersion Analysis (if asked)
Key columns: genewise_dispersions, fitted_dispersions, MAP_dispersions, dispersions. See dispersion_analysis.md.
Step 6: Enrichment (optional)
Use gseapy enrich() with appropriate gene set library. See enrichment_analysis.md.
Step 7: Gene Annotation (optional)
Use ToolUniverse for ID conversion and gene context only. See output_formatting.md.
Common Patterns
| Pattern | Type | Key Operation |
|---|---|---|
| 1 | DEG count | len(results[(padj<0.05) & (abs(lfc)>0.5)]) |
| 2 | Gene value | results.loc['GENE', 'log2FoldChange'] |
| 3 | Direction | Filter log2FoldChange > 0 or < 0 |
| 4 | Set ops | degs_A - degs_B for unique DEGs |
| 5 | Dispersion | (dds.var['genewise_dispersions'] < thr).sum() |
See bixbench_examples.md for all 10 patterns with examples.
Error Quick Reference
| Error | Fix |
|---|---|
| No matching samples | Transpose counts; strip whitespace |
| Dispersion trend no converge | fit_type='mean' |
| Contrast not found | Check metadata['factor'].unique() |
| Non-integer counts | Round to int OR use t-test |
| NaN in padj | Independent filtering removed genes |
See troubleshooting.md for full debugging guide.
Known Limitations
- PyDESeq2 vs R DESeq2: Numerical differences exist for very low dispersion genes (<1e-05). For exact R reproducibility, use rpy2.
- gseapy vs R clusterProfiler: Results may differ. See r_clusterprofiler_guide.md.
Reference Files
- question_parsing.md - Extract parameters from questions
- data_loading.md - Data loading and validation
- design_formula_guide.md - Multi-factor design decision tree
- pydeseq2_workflow.md - Complete PyDESeq2 code examples
- result_filtering.md - Advanced filtering and extraction
- dispersion_analysis.md - Dispersion diagnostics
- enrichment_analysis.md - GO/KEGG/Reactome workflows
- output_formatting.md - Format answers correctly
- bixbench_examples.md - All 10 question patterns
- troubleshooting.md - Common issues and debugging
- r_clusterprofiler_guide.md - R clusterProfiler via rpy2
Utility Scripts
- format_deseq2_output.py - Output formatters
- load_count_matrix.py - Data loading utilities