Bulk RNA-seq differential expression with omicverse
Overview
Follow this skill to run the end-to-end differential expression (DEG) workflow showcased in t_deg.ipynb . It assumes the user provides a raw gene-level count matrix (e.g., from featureCounts) and wants to analyse bulk RNA-seq cohorts inside omicverse.
Instructions
-
Set up the session
-
Import omicverse as ov , scanpy as sc , and matplotlib.pyplot as plt .
-
Call ov.plot_set() so downstream plots adopt omicverse styling.
-
Prepare ID mapping assets
-
When gene IDs must be converted to gene symbols, instruct the user to download mapping pairs via ov.utils.download_geneid_annotation_pair() and store them under genesets/ .
-
Mention the available prebuilt genomes (T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, danRer11) and that users can generate their own mapping from GTF files if needed.
-
Load the raw counts
-
Read tab-delimited featureCounts output with ov.pd.read_csv(..., sep='\t', header=1, index_col=0) .
-
Strip trailing .bam segments from column names using list comprehension so sample IDs are clean.
-
Map gene identifiers
-
Run ov.bulk.Matrix_ID_mapping(counts_df, 'genesets/pair_<GENOME>.tsv') to replace gene_id entries with gene symbols.
-
Initialise the DEG object
-
Create dds = ov.bulk.pyDEG(mapped_counts) .
-
Handle duplicate gene symbols with dds.drop_duplicates_index() to keep the highest expressed version.
-
Normalise and estimate size factors
-
Execute dds.normalize() to calculate DESeq2 size factors, correcting for library size and batch differences.
-
Run differential testing
-
Collect treatment and control replicate labels into lists.
-
Call dds.deg_analysis(treatment_groups, control_groups, method='ttest') for the default Welch t-test.
-
Offer optional alternatives: method='edgepy' for edgeR-like tests and method='limma' for limma-style modelling.
-
Filter and threshold results
-
Note that lowly expressed genes are retained by default; filter using dds.result.loc[dds.result['log2(BaseMean)'] > 1] when needed.
-
Set dynamic fold-change and significance cutoffs via dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6) (fc_threshold=-1 auto-selects based on log2FC distribution).
-
Visualise differential expression
-
Produce volcano plots with dds.plot_volcano(title=..., figsize=..., plot_genes=... or plot_genes_num=...) to highlight key genes.
-
Generate per-gene boxplots using dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=..., legend_bbox=...) ; adjust y-axis tick labels if required.
-
Perform pathway enrichment (optional)
-
Download curated pathway libraries through ov.utils.download_pathway_database() .
-
Load genesets with ov.utils.geneset_prepare(<path>, organism='Mouse'|'Human'|...) .
-
Build the DEG gene list from dds.result.loc[dds.result['sig'] != 'normal'].index .
-
Run enrichment with ov.bulk.geneset_enrichment(gene_list=deg_genes, pathways_dict=..., pvalue_type='auto', organism=...) . Encourage users without internet access to provide a background gene list.
-
Visualise single-library results via ov.bulk.geneset_plot(...) and combine multiple ontologies using ov.bulk.geneset_plot_multi(enr_dict, colors_dict, num=...) .
-
Document outputs
-
Suggest exporting dds.result and enrichment tables to CSV for downstream reporting.
-
Encourage users to save figures generated by matplotlib (plt.savefig(...) ) when running outside notebooks.
-
Defensive validation
Before DEG: verify treatment/control groups exist as column names
all_cols = set(dds.result.columns) if hasattr(dds, 'result') else set(counts_df.columns) for g in treatment_groups + control_groups: assert g in all_cols, f"Sample '{g}' not found in count matrix columns"
Verify groups don't overlap
assert not set(treatment_groups) & set(control_groups), "Treatment and control groups must not overlap"
-
Troubleshooting tips
-
Ensure sample labels in treatment_groups /control_groups exactly match column names post-cleanup.
-
Verify required packages (omicverse , pyComplexHeatmap , gseapy ) are installed for enrichment visualisations.
-
Remind users that internet access is required the first time they download gene mappings or pathway databases.
Examples
-
"I have a featureCounts matrix for mouse tumour samples—normalize it with DESeq2, run t-test DEG, and highlight the top 8 genes in a volcano plot."
-
"Use omicverse to compute edgeR-style differential expression between treated and control replicates, then run GO enrichment on significant genes."
-
"Guide me through converting Ensembl IDs to symbols, performing limma DEG, and plotting boxplots for Krtap9-5 and Lef1."
References
-
Detailed walkthrough notebook: t_deg.ipynb
-
Sample count matrix for testing: sample/counts.txt
-
Quick copy/paste commands: reference.md