Bulk RNA-seq batch correction with ComBat

Overview

Apply this skill when a user has multiple bulk expression matrices measured across different batches and needs to harmonise them before downstream analysis. It follows t_bulk_combat.ipynb , w hich demonstrates the pyComBat workflow on ovarian cancer microarray cohorts.

Instructions

Import core libraries
Load omicverse as ov , anndata , pandas as pd , and matplotlib.pyplot as plt .
Call ov.ov_plot_set() (aliased ov.plot_set() in some releases) to align figures with omicverse styling.
Load each batch separately
Read the prepared pickled matrices (or user-provided expression tables) with pd.read_pickle(...) /pd.read_csv(...) .
Transpose to gene × sample before wrapping them in anndata.AnnData objects so adata.obs stores sample metadata.
Assign a batch column for every cohort (adata.obs['batch'] = '1' , '2' , ...). Encourage descriptive labels when availa ble.
Concatenate on shared genes
Use anndata.concat([adata1, adata2, adata3], merge='same') to retain the intersection of genes across batches.
Confirm the combined adata reports balanced sample counts per batch; if not, prompt users to re-check inputs.
Run ComBat batch correction
Execute ov.bulk.batch_correction(adata, batch_key='batch') .
Explain that corrected values are stored in adata.layers['batch_correction'] while the original counts remain in adata.X .
Export corrected and raw matrices
Obtain DataFrames via adata.to_df().T (raw) and adata.to_df(layer='batch_correction').T (corrected).
Encourage saving both tables (.to_csv(...) ) plus the harmonised AnnData (adata.write_h5ad('adata_batch.h5ad', compressio n='gzip') ).
Benchmark the correction
For per-sample variance checks, draw before/after boxplots and recolour boxes using ov.utils.red_color , blue_color , gree n_color palettes to match batches.
Copy raw counts to a named layer with adata.layers['raw'] = adata.X.copy() before PCA.
Run ov.pp.pca(adata, layer='raw', n_pcs=50) and ov.pp.pca(adata, layer='batch_correction', n_pcs=50) .
Visualise embeddings with ov.utils.embedding(..., basis='raw|original|X_pca', color='batch', frameon='small') and repeat fo r the corrected layer to verify mixing.
Defensive validation

Before ComBat: verify batch column exists and has >1 batch

assert 'batch' in adata.obs.columns, "adata.obs must contain a 'batch' column" n_batches = adata.obs['batch'].nunique() assert n_batches > 1, f"Only {n_batches} batch — need >1 for batch correction"

Verify gene overlap after concatenation

if adata.n_vars < 100: print(f"WARNING: Only {adata.n_vars} shared genes after concat — check gene ID harmonization")

Troubleshooting tips
Mismatched gene identifiers cause dropped features—remind users to harmonise feature names (e.g., gene symbols) before conca tenation.
pyComBat expects log-scale intensities or similarly distributed counts; recommend log-transforming strongly skewed matrices.
If batch_correction layer is missing, ensure the batch_key matches the column name in adata.obs .

Examples

"Combine three GEO ovarian cohorts, run ComBat, and export both the raw and corrected CSV matrices."
"Plot PCA embeddings before and after batch correction to confirm that batches 1–3 overlap."
"Save the harmonised AnnData file so I can reload it later for downstream DEG analysis."

References

Tutorial notebook: t_bulk_combat.ipynb
Example inputs: omicverse_guide/docs/Tutorials-bulk/data/combat/
Quick copy/paste commands: reference.md

bulk-rna-seq-batch-correction-with-combat

Safety Notice

Copy this and send it to your AI assistant to learn

Before ComBat: verify batch column exists and has >1 batch

Verify gene overlap after concatenation

Source Transparency

Related Skills

data-viz-plots

data-export-pdf

data-export-excel

single-cell-multi-omics-integration