Gene Set Enrichment Analysis (GSEA)
Core Concept
GSEA uses all genes ranked by a statistic (log2FC, signed p-value) rather than a subset of significant genes. It finds gene sets where members are enriched at the top or bottom of the ranked list.
Prepare Ranked Gene List
library(clusterProfiler) library(org.Hs.eg.db)
de_results <- read.csv('de_results.csv')
Create named vector: values = statistic, names = gene IDs
gene_list <- de_results$log2FoldChange names(gene_list) <- de_results$gene_id
Sort in decreasing order (REQUIRED)
gene_list <- sort(gene_list, decreasing = TRUE)
Convert Gene IDs for GSEA
Convert symbols to Entrez IDs
gene_ids <- bitr(names(gene_list), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
Create ranked list with Entrez IDs
gene_list_entrez <- gene_list[names(gene_list) %in% gene_ids$SYMBOL] names(gene_list_entrez) <- gene_ids$ENTREZID[match(names(gene_list_entrez), gene_ids$SYMBOL)] gene_list_entrez <- sort(gene_list_entrez, decreasing = TRUE)
Alternative Ranking Statistics
Signed p-value (recommended for detecting both up and down)
gene_list <- -log10(de_results$pvalue) * sign(de_results$log2FoldChange) names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)
Wald statistic (from DESeq2)
gene_list <- de_results$stat names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)
GSEA with GO
gse_go <- gseGO( geneList = gene_list_entrez, OrgDb = org.Hs.eg.db, ont = 'BP', # BP, MF, CC, or ALL minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE, pAdjustMethod = 'BH' )
Make readable
gse_go <- setReadable(gse_go, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with KEGG
gse_kegg <- gseKEGG( geneList = gene_list_entrez, organism = 'hsa', minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE )
Make readable
gse_kegg <- setReadable(gse_kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with Custom Gene Sets
Read GMT file (Gene Matrix Transposed)
gene_sets <- read.gmt('msigdb_hallmarks.gmt')
gse_custom <- GSEA( geneList = gene_list_entrez, TERM2GENE = gene_sets, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05 )
MSigDB Gene Sets
Use msigdbr package for MSigDB gene sets
library(msigdbr)
Hallmark gene sets
hallmarks <- msigdbr(species = 'Homo sapiens', category = 'H') hallmarks_t2g <- hallmarks[, c('gs_name', 'entrez_gene')]
gse_hallmark <- GSEA( geneList = gene_list_entrez, TERM2GENE = hallmarks_t2g, pvalueCutoff = 0.05 )
Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)
Understanding Results
View results
head(gse_go) results <- as.data.frame(gse_go)
Key columns:
- NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated)
- pvalue: Nominal p-value
- p.adjust: FDR-adjusted p-value
- core_enrichment: Leading edge genes
Interpreting NES (Normalized Enrichment Score)
NES Interpretation
Positive (> 0) Gene set enriched in upregulated genes
Negative (< 0) Gene set enriched in downregulated genes
NES
Key Parameters
Parameter Default Description
geneList required Named, sorted numeric vector
OrgDb required Organism database (for gseGO)
organism hsa KEGG organism code (for gseKEGG)
ont BP Ontology: BP, MF, CC, ALL
minGSSize 10 Min genes in gene set
maxGSSize 500 Max genes in gene set
pvalueCutoff 0.05 P-value threshold
pAdjustMethod BH Adjustment method
nPerm 10000 Permutations (if permutation test used)
eps 1e-10 Boundary for p-value calculation
Export Results
results_df <- as.data.frame(gse_go) write.csv(results_df, 'gsea_go_results.csv', row.names = FALSE)
Get leading edge genes for a term
leading_edge <- strsplit(results_df$core_enrichment[1], '/')[[1]]
Notes
-
Must be sorted - gene list must be sorted in decreasing order
-
Named vector - names are gene IDs, values are statistics
-
No arbitrary cutoffs - uses all genes, not just significant ones
-
NES sign matters - positive = upregulated enrichment
-
Leading edge - core_enrichment contains driving genes
Related Skills
-
go-enrichment - Over-representation analysis for GO
-
kegg-pathways - Over-representation analysis for KEGG
-
enrichment-visualization - GSEA plots, ridge plots