edgeR Basics

Differential expression analysis using edgeR's quasi-likelihood framework for RNA-seq count data.

Required Libraries

library(edgeR) library(limma) # For design matrices and voom

Installation

if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager') BiocManager::install('edgeR')

Creating DGEList Object

From count matrix

counts: matrix with genes as rows, samples as columns

group: factor indicating sample groups

y <- DGEList(counts = counts, group = group)

With gene annotation

y <- DGEList(counts = counts, group = group, genes = gene_info)

Check structure

Standard edgeR Workflow (Quasi-Likelihood)

Create DGEList

y <- DGEList(counts = counts, group = group)

Filter low-expression genes

keep <- filterByExpr(y, group = group) y <- y[keep, , keep.lib.sizes = FALSE]

Normalize (TMM by default)

y <- calcNormFactors(y)

Create design matrix

design <- model.matrix(~ group)

Estimate dispersion (optional in edgeR v4+ but improves BCV plots)

y <- estimateDisp(y, design)

Fit quasi-likelihood model

fit <- glmQLFit(y, design)

Perform quasi-likelihood F-test

qlf <- glmQLFTest(fit, coef = 2)

View top genes

topTags(qlf)

Filtering Low-Expression Genes

Automatic filtering (recommended)

keep <- filterByExpr(y, group = group) y <- y[keep, , keep.lib.sizes = FALSE]

Manual filtering: CPM threshold

keep <- rowSums(cpm(y) > 1) >= 2 # At least 2 samples with CPM > 1 y <- y[keep, , keep.lib.sizes = FALSE]

Filter by minimum counts

keep <- rowSums(y$counts >= 10) >= 3 # At least 3 samples with 10+ counts y <- y[keep, , keep.lib.sizes = FALSE]

Normalization Methods

TMM normalization (default, recommended)

y <- calcNormFactors(y, method = 'TMM')

Alternative methods

y <- calcNormFactors(y, method = 'RLE') # Relative Log Expression y <- calcNormFactors(y, method = 'upperquartile') y <- calcNormFactors(y, method = 'none') # No normalization

View normalization factors

y$samples$norm.factors

Design Matrices

Simple two-group comparison

design <- model.matrix(~ group)

With batch correction

design <- model.matrix(~ batch + group)

Interaction model

design <- model.matrix(~ genotype + treatment + genotype:treatment)

No intercept (for direct group comparisons)

design <- model.matrix(~ 0 + group) colnames(design) <- levels(group)

Dispersion Estimation

Estimate all dispersions

y <- estimateDisp(y, design)

Or estimate separately

y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design)

View dispersions

y$common.dispersion y$trended.dispersion y$tagwise.dispersion

Plot BCV (biological coefficient of variation)

plotBCV(y)

Quasi-Likelihood Testing

Fit QL model

fit <- glmQLFit(y, design)

Test specific coefficient

qlf <- glmQLFTest(fit, coef = 2)

Test with contrast

contrast <- makeContrasts(groupB - groupA, levels = design) qlf <- glmQLFTest(fit, contrast = contrast)

Test multiple coefficients (ANOVA-like)

qlf <- glmQLFTest(fit, coef = 2:3)

Making Contrasts

Design without intercept

design <- model.matrix(~ 0 + group) colnames(design) <- levels(group) y <- estimateDisp(y, design) fit <- glmQLFit(y, design)

Pairwise comparisons

contrast <- makeContrasts( TreatedVsControl = treated - control, DrugAVsControl = drugA - control, DrugBVsControl = drugB - control, DrugAVsDrugB = drugA - drugB, levels = design )

Test each contrast

qlf_treated <- glmQLFTest(fit, contrast = contrast[, 'TreatedVsControl']) qlf_drugA <- glmQLFTest(fit, contrast = contrast[, 'DrugAVsControl'])

Accessing Results

Top differentially expressed genes

topTags(qlf, n = 20)

All results as data frame

results <- topTags(qlf, n = Inf)$table

Summary of DE genes at different thresholds

summary(decideTests(qlf))

Get DE genes with specific cutoffs

de_genes <- topTags(qlf, n = Inf, p.value = 0.05)$table

Result Columns

Column Description

logFC

Log2 fold change

logCPM

Average log2 counts per million

Quasi-likelihood F-statistic

PValue

Raw p-value

FDR

False discovery rate (adjusted p-value)

Alternative: Exact Test (Classic edgeR)

For simple two-group comparison only

y <- DGEList(counts = counts, group = group) y <- calcNormFactors(y) y <- estimateDisp(y)

Exact test

et <- exactTest(y) topTags(et)

Alternative: glmLRT (Likelihood Ratio Test)

Fit GLM

fit <- glmFit(y, design)

Likelihood ratio test

lrt <- glmLRT(fit, coef = 2) topTags(lrt)

Treat Test (Log Fold Change Threshold)

Test for |logFC| > threshold

tr <- glmTreat(fit, coef = 2, lfc = log2(1.5)) # |FC| > 1.5 topTags(tr)

Multi-Factor Designs

Design with batch correction

design <- model.matrix(~ batch + condition, data = sample_info) y <- estimateDisp(y, design) fit <- glmQLFit(y, design)

Test condition effect (controlling for batch)

Condition coefficient is typically the last

qlf <- glmQLFTest(fit, coef = ncol(design))

Getting Normalized Counts

Counts per million (CPM)

cpm_values <- cpm(y)

Log2 CPM

log_cpm <- cpm(y, log = TRUE)

RPM (reads per million, same as CPM)

rpm_values <- cpm(y)

With prior count for log transformation

log_cpm <- cpm(y, log = TRUE, prior.count = 2)

Exporting Results

Get all results

all_results <- topTags(qlf, n = Inf)$table

Add gene IDs as column

all_results$gene_id <- rownames(all_results)

Write to file

write.csv(all_results, file = 'edger_results.csv', row.names = FALSE)

Export normalized counts

write.csv(cpm(y), file = 'cpm_values.csv')

Common Errors

Error Cause Solution

"design matrix not full rank" Confounded variables Check sample metadata

"No residual df" Too few samples Need more replicates

"NA/NaN/Inf" Zero counts in all samples Filter more stringently

Deprecated/Changed Functions

Old Status New

decidetestsDGE()

Removed (v4.4) decideTests()

glmFit()

glmLRT()

Still works Prefer glmQLFit()

glmQLFTest()

estimateDisp()

Optional (v4+) glmQLFit() estimates internally

mglmLS() , mglmSimple()

Retired mglmLevenberg() or mglmOneWay()

Note: calcNormFactors() and normLibSizes() are synonyms - both work.

Quick Reference: Workflow Steps

1. Create DGEList

y <- DGEList(counts = counts, group = group)

2. Filter low-expression genes

keep <- filterByExpr(y, group = group) y <- y[keep, , keep.lib.sizes = FALSE]

3. Normalize

y <- calcNormFactors(y)

4. Create design matrix

design <- model.matrix(~ group)

5. Estimate dispersion (optional in v4+)

y <- estimateDisp(y, design)

6. Fit quasi-likelihood model

fit <- glmQLFit(y, design)

7. Test for DE

qlf <- glmQLFTest(fit, coef = 2)

8. Get results

topTags(qlf, n = 20)

Choosing edgeR vs DESeq2

Aspect edgeR DESeq2

Model Negative binomial + QL Negative binomial

Shrinkage Empirical Bayes on dispersions Shrinkage estimators for LFC

Small samples Robust with QL framework Good with shrinkage

Speed Generally faster Slower for large datasets

Output F-statistic, FDR Wald statistic, padj

Related Skills

deseq2-basics - Alternative DE analysis with DESeq2
de-visualization - MA plots, volcano plots, heatmaps
de-results - Extract and export significant genes

bio-de-edger-basics

Safety Notice

Copy this and send it to your AI assistant to learn

From count matrix

counts: matrix with genes as rows, samples as columns

group: factor indicating sample groups

With gene annotation

Check structure

Create DGEList

Filter low-expression genes

Normalize (TMM by default)

Create design matrix

Estimate dispersion (optional in edgeR v4+ but improves BCV plots)

Fit quasi-likelihood model

Perform quasi-likelihood F-test

View top genes

Automatic filtering (recommended)

Manual filtering: CPM threshold

Filter by minimum counts

TMM normalization (default, recommended)

Alternative methods

View normalization factors

Simple two-group comparison

With batch correction

Interaction model

No intercept (for direct group comparisons)

Estimate all dispersions

Or estimate separately

View dispersions

Plot BCV (biological coefficient of variation)

Fit QL model

Test specific coefficient

Test with contrast

Test multiple coefficients (ANOVA-like)

Design without intercept

Pairwise comparisons

Test each contrast

Top differentially expressed genes

All results as data frame

Summary of DE genes at different thresholds

Get DE genes with specific cutoffs

For simple two-group comparison only

Exact test

Fit GLM

Likelihood ratio test

Test for |logFC| > threshold

Design with batch correction

Test condition effect (controlling for batch)

Condition coefficient is typically the last

Counts per million (CPM)

Log2 CPM

RPM (reads per million, same as CPM)

With prior count for log transformation

Get all results

Add gene IDs as column

Write to file

Export normalized counts

1. Create DGEList

2. Filter low-expression genes

3. Normalize

4. Create design matrix

5. Estimate dispersion (optional in v4+)

6. Fit quasi-likelihood model

7. Test for DE

8. Get results

Source Transparency

Related Skills

bio-microbiome-diversity-analysis

bio-proteomics-dia-analysis

bio-metabolomics-statistical-analysis

bio-restriction-fragment-analysis