Single-Cell Genomics and Expression Matrix Analysis
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
When to Use This Skill
Apply when users:
- Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
- Ask about cell type identification, clustering, or annotation
- Need differential expression analysis by cell type or condition
- Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
- Ask about PCA, UMAP, t-SNE for expression data
- Need Leiden/Louvain clustering on expression matrices
- Want statistical comparisons between cell types (t-test, ANOVA, fold change)
- Ask about marker genes, batch correction, trajectory, or cell-cell communication
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
NOT for (use other skills instead):
- Bulk RNA-seq DESeq2 only ->
tooluniverse-rnaseq-deseq2 - Gene enrichment only ->
tooluniverse-gene-enrichment - VCF/variant analysis ->
tooluniverse-variant-analysis
Core Principles
- Data-first - Load, inspect, validate before analysis
- AnnData-centric - All data flows through anndata objects
- Cell type awareness - Per-cell-type subsetting when needed
- Statistical rigor - Normalization, multiple testing correction, effect sizes
- Question-driven - Parse what the user is actually asking
Required Packages
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp # enrichment
import harmonypy # batch correction (optional)
Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels
Workflow Decision Tree
START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
| Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
| See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
| Most common BixBench pattern (bix-33)
| See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
| Pattern: Gene length vs expression (bix-22)
| See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
| See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
| See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
See: references/trajectory_analysis.md
Data format handling:
- h5ad ->
sc.read_h5ad() - 10X ->
sc.read_10x_mtx()orsc.read_10x_h5() - CSV/TSV ->
pd.read_csv()-> Convert to AnnData (check orientation!)
Data Loading
AnnData expects: cells/samples as rows (obs), genes as columns (var)
adata = sc.read_h5ad("data.h5ad") # h5ad already oriented
# CSV/TSV: check orientation
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5: # genes > samples by 5x => transpose
df = df.T
adata = ad.AnnData(df)
# Load metadata
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
Quality Control
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)
See: references/scanpy_workflow.md for details
Complete Pipeline (Quick Reference)
import scanpy as sc
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# Normalize + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
# Cluster + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
# Find markers + Annotate + Per-cell-type DE
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
Differential Expression Decision Tree
Single-Cell DE (many cells per condition):
Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
Best for: Per-cell-type DE, marker gene finding
Pseudo-Bulk DE (aggregate counts by sample):
Use: DESeq2 via PyDESeq2
Best for: Sample-level comparisons with replicates
Statistical Tests Only:
Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
Best for: Correlation, ANOVA, t-tests on summaries
Statistical Tests (Quick Reference)
from scipy import stats
from statsmodels.stats.multitest import multipletests
# Pearson/Spearman correlation
r, p = stats.pearsonr(gene_lengths, mean_expression)
# Welch's t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
# Multiple testing correction (BH)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
Batch Correction (Harmony)
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
ToolUniverse Integration
Gene Annotation
- HPA_search_genes_by_query: Cell-type marker gene search
- MyGene_query_genes / MyGene_batch_query: Gene ID conversion
- ensembl_lookup_gene: Ensembl gene details
- UniProt_get_function_by_accession: Protein function
Cell-Cell Communication
- OmniPath_get_ligand_receptor_interactions: L-R pairs (CellPhoneDB, CellChatDB)
- OmniPath_get_signaling_interactions: Downstream signaling
- OmniPath_get_complexes: Multi-subunit receptors
Enrichment (Post-DE)
- PANTHER_enrichment: GO enrichment (BP, MF, CC)
- STRING_functional_enrichment: Network-based enrichment
- ReactomeAnalysis_pathway_enrichment: Reactome pathways
Scanpy vs Seurat Equivalents
| Operation | Seurat (R) | Scanpy (Python) |
|---|---|---|
| Load data | Read10X() | sc.read_10x_mtx() |
| Normalize | NormalizeData() | sc.pp.normalize_total() + sc.pp.log1p() |
| Find HVGs | FindVariableFeatures() | sc.pp.highly_variable_genes() |
| PCA | RunPCA() | sc.tl.pca() |
| Cluster | FindClusters() | sc.tl.leiden() |
| UMAP | RunUMAP() | sc.tl.umap() |
| Find markers | FindMarkers() | sc.tl.rank_genes_groups() |
| Batch correction | RunHarmony() | harmonypy.run_harmony() |
Troubleshooting
| Issue | Solution |
|---|---|
ModuleNotFoundError: leidenalg | pip install leidenalg |
| Sparse matrix errors | .toarray(): X = adata.X.toarray() if issparse(adata.X) else adata.X |
| Wrong matrix orientation | More genes than samples? Transpose |
| NaN in correlation | Filter: valid = ~np.isnan(x) & ~np.isnan(y) |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Memory error | Use sc.pp.highly_variable_genes() to reduce features |
Reference Documentation
Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)
Core Workflows:
- references/scanpy_workflow.md - Complete scanpy pipeline
- references/seurat_workflow.md - Seurat to Scanpy translation
- references/clustering_guide.md - Clustering methods
- references/marker_identification.md - Marker genes, annotation
- references/trajectory_analysis.md - Pseudotime
- references/cell_communication.md - OmniPath/CellPhoneDB workflow
- references/troubleshooting.md - Detailed error solutions