RNA-Seq Differential Expression with DESeq2 — Complete Guide

? What Is Differential Expression?

When you compare two biological conditions (e.g., disease vs. healthy, treated vs. untreated), some genes will be more active (upregulated) or less active (downregulated) in one condition versus the other. Differential expression (DE) analysis is the statistical framework that identifies these genes from RNA-Seq count data.

The core challenge: raw read counts are affected by sequencing depth, gene length, and biological variability. A robust DE tool must account for all three. DESeq2 handles this through a negative binomial generalized linear model with shrinkage estimators for dispersion and fold change.

Key Insight: DESeq2 works on raw integer counts, not FPKM or TPM. It performs its own internal normalization (median-of-ratios). Feeding it pre-normalized data is a common mistake that inflates false positives.

How DESeq2 Works (Simplified Pipeline)

Size factor estimation — normalizes for sequencing depth differences across samples using median-of-ratios.
Dispersion estimation — models biological variability per gene, then shrinks gene-level estimates toward a fitted trend (borrowing strength across genes).
Generalized linear model fitting — fits a negative binomial GLM per gene using the design formula.
Wald test or LRT — tests for significance of coefficients (fold changes).
Independent filtering & BH adjustment — filters low-count genes, then adjusts p-values for multiple testing using Benjamini-Hochberg.

0 Prerequisites & Installation

You need R ≥ 4.3 and optionally RStudio. DESeq2 is installed from Bioconductor, not CRAN.

R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

# Additional packages used in this tutorial
install.packages(c("tidyverse", "pheatmap", "RColorBrewer", "ggrepel"))
BiocManager::install("EnhancedVolcano")

Pro Tip: Use BiocManager::version() to verify your Bioconductor release matches your R version. Mismatches cause cryptic installation failures. Check the Bioconductor install page for the compatibility matrix.

1 Load Your Count Matrix

Your input is a genes × samples matrix of raw integer counts. This typically comes from a read aligner (STAR, HISAT2) followed by a counter (featureCounts, HTSeq). Many modern pipelines like nf-core/rnaseq produce this matrix automatically.

R

library(DESeq2)
library(tidyverse)

# Read raw count matrix (rows = genes, columns = samples)
count_matrix <- read.delim("counts.txt", row.names = 1) %>%
  as.matrix()

# Quick sanity checks
dim(count_matrix)          # Expect: num_genes x num_samples
head(count_matrix[, 1:6])  # Peek at first rows/columns
stopifnot(is.integer(count_matrix[1,1]) || all(count_matrix == floor(count_matrix)))

# Build sample metadata (coldata)
# Rows of coldata must match columns of count_matrix in the same order
sample_info <- data.frame(
  row.names = colnames(count_matrix),
  condition = factor(c(rep("Control", 3), rep("Treated", 3))),
  batch     = factor(c("A","A","B","A","B","B"))
)

# CRITICAL: set the reference level for your comparison
sample_info$condition <- relevel(sample_info$condition, ref = "Control")
print(sample_info)

Common Pitfall: If your count matrix has decimal values, something went wrong upstream — either you loaded TPM/FPKM data, or a tool performed partial normalization. Go back and get the raw integer counts. DESeq2 will throw a warning or silently round, but your results will not be trustworthy.

Pre-filtering (Optional but Recommended)

Removing genes with very few reads across all samples speeds up computation and reduces the severity of the multiple testing correction.

R

# Keep genes where at least 3 samples have 10+ counts
keep <- rowSums(count_matrix >= 10) >= 3
count_matrix <- count_matrix[keep, ]
cat("Genes retained after pre-filtering:", nrow(count_matrix), "\n")

How to Choose a Pre-Filter Cutoff

Rule of thumb: require at least n samples (size of your smallest group) to have ≥ 10 counts.
Why 10? Genes below this contribute almost no statistical information — DESeq2's independent filtering handles them anyway, but pre-filtering reduces the object size and speeds up DESeq().
Stricter filtering (e.g., ≥ 20 counts in 50% of samples) is fine for discovery-focused analyses. More lenient filtering (≥ 5 counts) is appropriate for rare transcripts.

2 Build the DESeqDataSet Object

The DESeqDataSet bundles your counts, sample metadata, and design formula into one object.

R

dds <- DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData   = sample_info,
  design    = ~ batch + condition   # batch as covariate, condition as variable of interest
)
dds

Parameter	What It Does	How to Pick
`countData`	Raw integer count matrix	Genes × Samples. Must be integers.
`colData`	Sample metadata data.frame	Row names must exactly match column names of `countData`.
`design`	Formula describing the experimental design	Put confounders (batch) first, variable of interest (condition) last. The last variable is what `results()` tests by default.

Design Formula Tips:

~ condition — simplest two-group comparison
~ batch + condition — adjust for known batch effects
~ genotype + treatment + genotype:treatment — interaction model (e.g., does treatment effect differ by genotype?)
If you have paired samples (same patient, before/after), use ~ patient + condition

3 Run the DESeq2 Pipeline

A single function call runs the entire statistical pipeline: size factor estimation, dispersion estimation, and model fitting.

R

dds <- DESeq(dds)

# Inspect size factors (normalization)
sizeFactors(dds)

Grab a coffee! For large datasets (>30,000 genes, >50 samples), DESeq() may take 5–15 minutes. If you need speed, pass parallel = TRUE and register a BiocParallel backend first.

Understanding Size Factors

Each sample receives a size factor that corrects for sequencing depth. A sample with size factor 1.2 had ~20% more sequencing depth than the "average" sample. Extreme size factors (e.g., 0.3 or 3.0) may indicate a problematic library.

4 Explore Dispersion Estimates

The dispersion plot is your first quality checkpoint. It shows how biological variability relates to mean expression.

R

plotDispEsts(dds, main = "Dispersion Estimates")

Reading the Dispersion Plot:

Black dots = per-gene maximum likelihood estimates (noisy)
Red line = fitted trend (the model's expectation)
Blue dots = shrunken (final) estimates pulled toward the trend
Healthy sign: cloud of black dots around the red line that decreases with higher mean expression
Red flag: trend line going up, massive scatter, or all dispersions clustering unrealistically low — indicates batch effects or sample mislabeling

5 Extract & Shrink Results

R

# Extract results for the condition comparison
res <- results(dds,
               contrast = c("condition", "Treated", "Control"),
               alpha    = 0.05)      # FDR threshold for independent filtering

# Log2 fold change shrinkage (recommended for ranking & visualization)
res_shrunk <- lfcShrink(dds,
                         coef  = "condition_Treated_vs_Control",
                         type  = "apeglm")   # apeglm is the recommended shrinkage method

summary(res_shrunk)

Parameter	Purpose	Recommendation
`contrast`	Specifies which comparison to extract	`c("variable", "numerator", "denominator")`. Positive LFC = higher in numerator.
`alpha`	FDR cutoff used by independent filtering	Standard: `0.05`. Use `0.1` for exploratory; `0.01` for high-confidence.
`lfcShrink type`	Shrinkage estimator	`"apeglm"` (default recommendation). `"ashr"` allows zero-centered priors. `"normal"` is the legacy method.

Why Shrink Fold Changes?

Genes with very low counts can show extreme fold changes (e.g., 1 read in control, 3 reads in treated = log2FC of 1.58). These are unreliable. LFC shrinkage pulls noisy estimates toward zero, giving more trustworthy rankings. Always use shrunken LFCs for ranking and visualization; use un-shrunken results for significance testing.

R

# Convert to tidy data frame and save
res_df <- as.data.frame(res_shrunk) %>%
  rownames_to_column("gene") %>%
  arrange(padj)

write.csv(res_df, "deseq2_results.csv", row.names = FALSE)
head(res_df)

How to Choose Significance Thresholds

This is one of the most misunderstood aspects of DE analysis. There is no universal correct cutoff — it depends on your experimental goals.

Threshold Decision Framework

Discovery / Exploratory

padj < 0.1
|log2FC| > 0.5
Casts a wide net
More false positives, fewer missed genes
Good for pathway/enrichment input

Standard Publication

padj < 0.05
|log2FC| > 1.0
Balanced approach
Most commonly used in literature
Good default if unsure

High-Confidence / Validation

padj < 0.01
|log2FC| > 2.0
Very stringent
Minimal false positives
Good for qPCR validation targets

Important Nuance: The padj cutoff controls the false discovery rate — at padj < 0.05, you expect ~5% of your "significant" genes to be false positives. The log2FC cutoff is about biological relevance — statistically significant but tiny changes (e.g., log2FC = 0.1) may not matter biologically. Always apply both thresholds together.

6 Visualize Your Results

PCA Plot (Quality Check)

PCA reveals global patterns and outliers. Samples in the same condition should cluster together.

R

# Variance-stabilizing transformation (for visualization only, not DE testing)
vsd <- vst(dds, blind = FALSE)

# PCA colored by condition
plotPCA(vsd, intgroup = "condition") +
  theme_minimal(base_size = 14) +
  ggtitle("PCA of Variance-Stabilized Counts")

Troubleshooting PCA: If samples don't cluster by your condition of interest but do cluster by batch, sequencing lane, or extraction date — you have a batch effect. Add that variable to your design formula (~ batch + condition). If samples are total outliers, consider removing them and re-running.

Volcano Plot

The most informative single plot for DE results: statistical significance (y-axis) vs. effect size (x-axis).

R

library(EnhancedVolcano)

EnhancedVolcano(res_shrunk,
  lab         = rownames(res_shrunk),
  x           = 'log2FoldChange',
  y           = 'padj',
  title       = 'Treated vs Control',
  subtitle    = 'DESeq2 with apeglm shrinkage',
  pCutoff     = 0.05,
  FCcutoff    = 1.0,
  pointSize   = 2.0,
  labSize     = 3.5,
  legendPosition = 'right',
  drawConnectors = TRUE,
  widthConnectors = 0.5)

MA Plot

R

plotMA(res_shrunk, ylim = c(-4, 4), main = "MA Plot (apeglm shrunken LFC)")

Sample-to-Sample Heatmap

R

library(pheatmap)

# Euclidean distances between samples
sample_dists <- dist(t(assay(vsd)))
pheatmap(as.matrix(sample_dists),
         clustering_distance_rows = sample_dists,
         clustering_distance_cols = sample_dists,
         color = colorRampPalette(c("navy","white","firebrick3"))(50),
         main = "Sample Distance Heatmap")

Top DE Genes Heatmap

R

# Top 30 DE genes by adjusted p-value
top_genes <- head(res_df$gene, 30)
mat <- assay(vsd)[top_genes, ]
mat <- mat - rowMeans(mat)   # Center by row mean

pheatmap(mat,
         scale           = "none",
         cluster_rows    = TRUE,
         cluster_cols    = TRUE,
         show_rownames   = TRUE,
         annotation_col  = sample_info,
         color           = colorRampPalette(c("#2166AC","white","#B2182B"))(100),
         main            = "Top 30 Differentially Expressed Genes")

7 Downstream: Functional Enrichment

Once you have your DE gene list, the next step is understanding what biological processes or pathways are enriched.

R

# Using clusterProfiler for GO enrichment
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")  # Change for your organism

library(clusterProfiler)
library(org.Hs.eg.db)

# Get significant upregulated genes
sig_genes <- res_df %>%
  filter(padj < 0.05, log2FoldChange > 1) %>%
  pull(gene)

# GO Biological Process enrichment
ego <- enrichGO(gene     = sig_genes,
                OrgDb    = org.Hs.eg.db,
                keyType  = "SYMBOL",
                ont      = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05)

dotplot(ego, showCategory = 15, title = "GO Biological Process Enrichment")

Beyond GO: Consider running GSEA (Gene Set Enrichment Analysis) on the full ranked gene list rather than just the significant genes. GSEA can detect coordinated pathway-level changes even when individual genes don't pass the significance cutoff. Use fgsea or clusterProfiler::gseGO().

DESeq2 vs. edgeR vs. limma-voom

All three are well-validated and widely published. Here's when to prefer each:

Tool	Strengths	Best When
DESeq2	Conservative, built-in shrinkage, great documentation	Small sample sizes (3–5 per group); default choice
edgeR	Fast, flexible, quasi-likelihood F-test	Complex designs; larger sample sizes; speed matters
limma-voom	Extremely fast, handles large designs elegantly	Very large datasets (hundreds of samples); microarray background

Practical Advice: For a typical RNA-Seq experiment with 3–6 replicates per group, DESeq2 and edgeR give very similar results. If you're publishing, running both and reporting the overlap (intersection) of DE genes is a strong validation strategy that reviewers appreciate.

Summary

What You Learned:

Load raw counts and build a properly designed DESeqDataSet
Run the full DESeq2 pipeline in a single function call
Interpret dispersion plots as a quality checkpoint
Extract results with appropriate shrinkage (apeglm)
Choose padj and log2FC thresholds based on your goals
Create PCA, volcano, MA, and heatmap visualizations
Perform functional enrichment on your DE gene list
Compare DESeq2 with edgeR and limma-voom

All Tutorials Next: WGCNA Networks

RNA-Seq Differential Expression Analysis

On this page