RNA-Seq Differential Expression Analysis

A from-scratch guide to identifying differentially expressed genes using DESeq2 in R — with every parameter explained and every threshold justified.

~45 min Intermediate R / RStudio
Byte says Let's Learn

? What Is Differential Expression?

When you compare two biological conditions (e.g., disease vs. healthy, treated vs. untreated), some genes will be more active (upregulated) or less active (downregulated) in one condition versus the other. Differential expression (DE) analysis is the statistical framework that identifies these genes from RNA-Seq count data.

The core challenge: raw read counts are affected by sequencing depth, gene length, and biological variability. A robust DE tool must account for all three. DESeq2 handles this through a negative binomial generalized linear model with shrinkage estimators for dispersion and fold change.

Byte key point
Key Insight: DESeq2 works on raw integer counts, not FPKM or TPM. It performs its own internal normalization (median-of-ratios). Feeding it pre-normalized data is a common mistake that inflates false positives.
How DESeq2 Works (Simplified Pipeline)
  1. Size factor estimation — normalizes for sequencing depth differences across samples using median-of-ratios.
  2. Dispersion estimation — models biological variability per gene, then shrinks gene-level estimates toward a fitted trend (borrowing strength across genes).
  3. Generalized linear model fitting — fits a negative binomial GLM per gene using the design formula.
  4. Wald test or LRT — tests for significance of coefficients (fold changes).
  5. Independent filtering & BH adjustment — filters low-count genes, then adjusts p-values for multiple testing using Benjamini-Hochberg.

0 Prerequisites & Installation

You need R ≥ 4.3 and optionally RStudio. DESeq2 is installed from Bioconductor, not CRAN.

R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

# Additional packages used in this tutorial
install.packages(c("tidyverse", "pheatmap", "RColorBrewer", "ggrepel"))
BiocManager::install("EnhancedVolcano")
Byte tip
Pro Tip: Use BiocManager::version() to verify your Bioconductor release matches your R version. Mismatches cause cryptic installation failures. Check the Bioconductor install page for the compatibility matrix.

1 Load Your Count Matrix

Your input is a genes × samples matrix of raw integer counts. This typically comes from a read aligner (STAR, HISAT2) followed by a counter (featureCounts, HTSeq). Many modern pipelines like nf-core/rnaseq produce this matrix automatically.

R
library(DESeq2)
library(tidyverse)

# Read raw count matrix (rows = genes, columns = samples)
count_matrix <- read.delim("counts.txt", row.names = 1) %>%
  as.matrix()

# Quick sanity checks
dim(count_matrix)          # Expect: num_genes x num_samples
head(count_matrix[, 1:6])  # Peek at first rows/columns
stopifnot(is.integer(count_matrix[1,1]) || all(count_matrix == floor(count_matrix)))

# Build sample metadata (coldata)
# Rows of coldata must match columns of count_matrix in the same order
sample_info <- data.frame(
  row.names = colnames(count_matrix),
  condition = factor(c(rep("Control", 3), rep("Treated", 3))),
  batch     = factor(c("A","A","B","A","B","B"))
)

# CRITICAL: set the reference level for your comparison
sample_info$condition <- relevel(sample_info$condition, ref = "Control")
print(sample_info)
Byte warning
Common Pitfall: If your count matrix has decimal values, something went wrong upstream — either you loaded TPM/FPKM data, or a tool performed partial normalization. Go back and get the raw integer counts. DESeq2 will throw a warning or silently round, but your results will not be trustworthy.
Pre-filtering (Optional but Recommended)

Removing genes with very few reads across all samples speeds up computation and reduces the severity of the multiple testing correction.

R
# Keep genes where at least 3 samples have 10+ counts
keep <- rowSums(count_matrix >= 10) >= 3
count_matrix <- count_matrix[keep, ]
cat("Genes retained after pre-filtering:", nrow(count_matrix), "\n")
How to Choose a Pre-Filter Cutoff
  • Rule of thumb: require at least n samples (size of your smallest group) to have ≥ 10 counts.
  • Why 10? Genes below this contribute almost no statistical information — DESeq2's independent filtering handles them anyway, but pre-filtering reduces the object size and speeds up DESeq().
  • Stricter filtering (e.g., ≥ 20 counts in 50% of samples) is fine for discovery-focused analyses. More lenient filtering (≥ 5 counts) is appropriate for rare transcripts.

2 Build the DESeqDataSet Object

The DESeqDataSet bundles your counts, sample metadata, and design formula into one object.

R
dds <- DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData   = sample_info,
  design    = ~ batch + condition   # batch as covariate, condition as variable of interest
)
dds
ParameterWhat It DoesHow to Pick
countDataRaw integer count matrixGenes × Samples. Must be integers.
colDataSample metadata data.frameRow names must exactly match column names of countData.
designFormula describing the experimental designPut confounders (batch) first, variable of interest (condition) last. The last variable is what results() tests by default.
Byte key point
Design Formula Tips:
  • ~ condition — simplest two-group comparison
  • ~ batch + condition — adjust for known batch effects
  • ~ genotype + treatment + genotype:treatment — interaction model (e.g., does treatment effect differ by genotype?)
  • If you have paired samples (same patient, before/after), use ~ patient + condition

3 Run the DESeq2 Pipeline

A single function call runs the entire statistical pipeline: size factor estimation, dispersion estimation, and model fitting.

R
dds <- DESeq(dds)

# Inspect size factors (normalization)
sizeFactors(dds)
Byte waiting
Grab a coffee! For large datasets (>30,000 genes, >50 samples), DESeq() may take 5–15 minutes. If you need speed, pass parallel = TRUE and register a BiocParallel backend first.
Understanding Size Factors

Each sample receives a size factor that corrects for sequencing depth. A sample with size factor 1.2 had ~20% more sequencing depth than the "average" sample. Extreme size factors (e.g., 0.3 or 3.0) may indicate a problematic library.

4 Explore Dispersion Estimates

The dispersion plot is your first quality checkpoint. It shows how biological variability relates to mean expression.

R
plotDispEsts(dds, main = "Dispersion Estimates")
Byte explains
Reading the Dispersion Plot:
  • Black dots = per-gene maximum likelihood estimates (noisy)
  • Red line = fitted trend (the model's expectation)
  • Blue dots = shrunken (final) estimates pulled toward the trend
  • Healthy sign: cloud of black dots around the red line that decreases with higher mean expression
  • Red flag: trend line going up, massive scatter, or all dispersions clustering unrealistically low — indicates batch effects or sample mislabeling

5 Extract & Shrink Results

R
# Extract results for the condition comparison
res <- results(dds,
               contrast = c("condition", "Treated", "Control"),
               alpha    = 0.05)      # FDR threshold for independent filtering

# Log2 fold change shrinkage (recommended for ranking & visualization)
res_shrunk <- lfcShrink(dds,
                         coef  = "condition_Treated_vs_Control",
                         type  = "apeglm")   # apeglm is the recommended shrinkage method

summary(res_shrunk)
ParameterPurposeRecommendation
contrastSpecifies which comparison to extractc("variable", "numerator", "denominator"). Positive LFC = higher in numerator.
alphaFDR cutoff used by independent filteringStandard: 0.05. Use 0.1 for exploratory; 0.01 for high-confidence.
lfcShrink typeShrinkage estimator"apeglm" (default recommendation). "ashr" allows zero-centered priors. "normal" is the legacy method.
Why Shrink Fold Changes?

Genes with very low counts can show extreme fold changes (e.g., 1 read in control, 3 reads in treated = log2FC of 1.58). These are unreliable. LFC shrinkage pulls noisy estimates toward zero, giving more trustworthy rankings. Always use shrunken LFCs for ranking and visualization; use un-shrunken results for significance testing.

R
# Convert to tidy data frame and save
res_df <- as.data.frame(res_shrunk) %>%
  rownames_to_column("gene") %>%
  arrange(padj)

write.csv(res_df, "deseq2_results.csv", row.names = FALSE)
head(res_df)

How to Choose Significance Thresholds

This is one of the most misunderstood aspects of DE analysis. There is no universal correct cutoff — it depends on your experimental goals.

Threshold Decision Framework
Discovery / Exploratory
  • padj < 0.1
  • |log2FC| > 0.5
  • Casts a wide net
  • More false positives, fewer missed genes
  • Good for pathway/enrichment input
Standard Publication
  • padj < 0.05
  • |log2FC| > 1.0
  • Balanced approach
  • Most commonly used in literature
  • Good default if unsure
High-Confidence / Validation
  • padj < 0.01
  • |log2FC| > 2.0
  • Very stringent
  • Minimal false positives
  • Good for qPCR validation targets
Byte key point
Important Nuance: The padj cutoff controls the false discovery rate — at padj < 0.05, you expect ~5% of your "significant" genes to be false positives. The log2FC cutoff is about biological relevance — statistically significant but tiny changes (e.g., log2FC = 0.1) may not matter biologically. Always apply both thresholds together.

6 Visualize Your Results

PCA Plot (Quality Check)

PCA reveals global patterns and outliers. Samples in the same condition should cluster together.

R
# Variance-stabilizing transformation (for visualization only, not DE testing)
vsd <- vst(dds, blind = FALSE)

# PCA colored by condition
plotPCA(vsd, intgroup = "condition") +
  theme_minimal(base_size = 14) +
  ggtitle("PCA of Variance-Stabilized Counts")
Byte warns
Troubleshooting PCA: If samples don't cluster by your condition of interest but do cluster by batch, sequencing lane, or extraction date — you have a batch effect. Add that variable to your design formula (~ batch + condition). If samples are total outliers, consider removing them and re-running.
Volcano Plot

The most informative single plot for DE results: statistical significance (y-axis) vs. effect size (x-axis).

R
library(EnhancedVolcano)

EnhancedVolcano(res_shrunk,
  lab         = rownames(res_shrunk),
  x           = 'log2FoldChange',
  y           = 'padj',
  title       = 'Treated vs Control',
  subtitle    = 'DESeq2 with apeglm shrinkage',
  pCutoff     = 0.05,
  FCcutoff    = 1.0,
  pointSize   = 2.0,
  labSize     = 3.5,
  legendPosition = 'right',
  drawConnectors = TRUE,
  widthConnectors = 0.5)
MA Plot
R
plotMA(res_shrunk, ylim = c(-4, 4), main = "MA Plot (apeglm shrunken LFC)")
Sample-to-Sample Heatmap
R
library(pheatmap)

# Euclidean distances between samples
sample_dists <- dist(t(assay(vsd)))
pheatmap(as.matrix(sample_dists),
         clustering_distance_rows = sample_dists,
         clustering_distance_cols = sample_dists,
         color = colorRampPalette(c("navy","white","firebrick3"))(50),
         main = "Sample Distance Heatmap")
Top DE Genes Heatmap
R
# Top 30 DE genes by adjusted p-value
top_genes <- head(res_df$gene, 30)
mat <- assay(vsd)[top_genes, ]
mat <- mat - rowMeans(mat)   # Center by row mean

pheatmap(mat,
         scale           = "none",
         cluster_rows    = TRUE,
         cluster_cols    = TRUE,
         show_rownames   = TRUE,
         annotation_col  = sample_info,
         color           = colorRampPalette(c("#2166AC","white","#B2182B"))(100),
         main            = "Top 30 Differentially Expressed Genes")

7 Downstream: Functional Enrichment

Once you have your DE gene list, the next step is understanding what biological processes or pathways are enriched.

R
# Using clusterProfiler for GO enrichment
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")  # Change for your organism

library(clusterProfiler)
library(org.Hs.eg.db)

# Get significant upregulated genes
sig_genes <- res_df %>%
  filter(padj < 0.05, log2FoldChange > 1) %>%
  pull(gene)

# GO Biological Process enrichment
ego <- enrichGO(gene     = sig_genes,
                OrgDb    = org.Hs.eg.db,
                keyType  = "SYMBOL",
                ont      = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05)

dotplot(ego, showCategory = 15, title = "GO Biological Process Enrichment")
Byte tip
Beyond GO: Consider running GSEA (Gene Set Enrichment Analysis) on the full ranked gene list rather than just the significant genes. GSEA can detect coordinated pathway-level changes even when individual genes don't pass the significance cutoff. Use fgsea or clusterProfiler::gseGO().

DESeq2 vs. edgeR vs. limma-voom

All three are well-validated and widely published. Here's when to prefer each:

ToolStrengthsBest When
DESeq2 Conservative, built-in shrinkage, great documentation Small sample sizes (3–5 per group); default choice
edgeR Fast, flexible, quasi-likelihood F-test Complex designs; larger sample sizes; speed matters
limma-voom Extremely fast, handles large designs elegantly Very large datasets (hundreds of samples); microarray background
Byte explains networks
Practical Advice: For a typical RNA-Seq experiment with 3–6 replicates per group, DESeq2 and edgeR give very similar results. If you're publishing, running both and reporting the overlap (intersection) of DE genes is a strong validation strategy that reviewers appreciate.

Summary

Byte
What You Learned:
  1. Load raw counts and build a properly designed DESeqDataSet
  2. Run the full DESeq2 pipeline in a single function call
  3. Interpret dispersion plots as a quality checkpoint
  4. Extract results with appropriate shrinkage (apeglm)
  5. Choose padj and log2FC thresholds based on your goals
  6. Create PCA, volcano, MA, and heatmap visualizations
  7. Perform functional enrichment on your DE gene list
  8. Compare DESeq2 with edgeR and limma-voom