A from-scratch guide to identifying differentially expressed genes using DESeq2 in R — with every parameter explained and every threshold justified.
When you compare two biological conditions (e.g., disease vs. healthy, treated vs. untreated), some genes will be more active (upregulated) or less active (downregulated) in one condition versus the other. Differential expression (DE) analysis is the statistical framework that identifies these genes from RNA-Seq count data.
The core challenge: raw read counts are affected by sequencing depth, gene length, and biological variability. A robust DE tool must account for all three. DESeq2 handles this through a negative binomial generalized linear model with shrinkage estimators for dispersion and fold change.
You need R ≥ 4.3 and optionally RStudio. DESeq2 is installed from Bioconductor, not CRAN.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
# Additional packages used in this tutorial
install.packages(c("tidyverse", "pheatmap", "RColorBrewer", "ggrepel"))
BiocManager::install("EnhancedVolcano")
BiocManager::version() to verify your Bioconductor release matches your R version. Mismatches cause cryptic installation failures. Check the Bioconductor install page for the compatibility matrix.
Your input is a genes × samples matrix of raw integer counts. This typically comes from a read aligner (STAR, HISAT2) followed by a counter (featureCounts, HTSeq). Many modern pipelines like nf-core/rnaseq produce this matrix automatically.
library(DESeq2)
library(tidyverse)
# Read raw count matrix (rows = genes, columns = samples)
count_matrix <- read.delim("counts.txt", row.names = 1) %>%
as.matrix()
# Quick sanity checks
dim(count_matrix) # Expect: num_genes x num_samples
head(count_matrix[, 1:6]) # Peek at first rows/columns
stopifnot(is.integer(count_matrix[1,1]) || all(count_matrix == floor(count_matrix)))
# Build sample metadata (coldata)
# Rows of coldata must match columns of count_matrix in the same order
sample_info <- data.frame(
row.names = colnames(count_matrix),
condition = factor(c(rep("Control", 3), rep("Treated", 3))),
batch = factor(c("A","A","B","A","B","B"))
)
# CRITICAL: set the reference level for your comparison
sample_info$condition <- relevel(sample_info$condition, ref = "Control")
print(sample_info)
Removing genes with very few reads across all samples speeds up computation and reduces the severity of the multiple testing correction.
# Keep genes where at least 3 samples have 10+ counts
keep <- rowSums(count_matrix >= 10) >= 3
count_matrix <- count_matrix[keep, ]
cat("Genes retained after pre-filtering:", nrow(count_matrix), "\n")
DESeq().The DESeqDataSet bundles your counts, sample metadata, and design formula into one object.
dds <- DESeqDataSetFromMatrix( countData = count_matrix, colData = sample_info, design = ~ batch + condition # batch as covariate, condition as variable of interest ) dds
| Parameter | What It Does | How to Pick |
|---|---|---|
countData | Raw integer count matrix | Genes × Samples. Must be integers. |
colData | Sample metadata data.frame | Row names must exactly match column names of countData. |
design | Formula describing the experimental design | Put confounders (batch) first, variable of interest (condition) last. The last variable is what results() tests by default. |
~ condition — simplest two-group comparison~ batch + condition — adjust for known batch effects~ genotype + treatment + genotype:treatment — interaction model (e.g., does treatment effect differ by genotype?)~ patient + conditionA single function call runs the entire statistical pipeline: size factor estimation, dispersion estimation, and model fitting.
dds <- DESeq(dds) # Inspect size factors (normalization) sizeFactors(dds)
DESeq() may take 5–15 minutes. If you need speed, pass parallel = TRUE and register a BiocParallel backend first.
Each sample receives a size factor that corrects for sequencing depth. A sample with size factor 1.2 had ~20% more sequencing depth than the "average" sample. Extreme size factors (e.g., 0.3 or 3.0) may indicate a problematic library.
The dispersion plot is your first quality checkpoint. It shows how biological variability relates to mean expression.
plotDispEsts(dds, main = "Dispersion Estimates")
# Extract results for the condition comparison
res <- results(dds,
contrast = c("condition", "Treated", "Control"),
alpha = 0.05) # FDR threshold for independent filtering
# Log2 fold change shrinkage (recommended for ranking & visualization)
res_shrunk <- lfcShrink(dds,
coef = "condition_Treated_vs_Control",
type = "apeglm") # apeglm is the recommended shrinkage method
summary(res_shrunk)
| Parameter | Purpose | Recommendation |
|---|---|---|
contrast | Specifies which comparison to extract | c("variable", "numerator", "denominator"). Positive LFC = higher in numerator. |
alpha | FDR cutoff used by independent filtering | Standard: 0.05. Use 0.1 for exploratory; 0.01 for high-confidence. |
lfcShrink type | Shrinkage estimator | "apeglm" (default recommendation). "ashr" allows zero-centered priors. "normal" is the legacy method. |
Genes with very low counts can show extreme fold changes (e.g., 1 read in control, 3 reads in treated = log2FC of 1.58). These are unreliable. LFC shrinkage pulls noisy estimates toward zero, giving more trustworthy rankings. Always use shrunken LFCs for ranking and visualization; use un-shrunken results for significance testing.
# Convert to tidy data frame and save
res_df <- as.data.frame(res_shrunk) %>%
rownames_to_column("gene") %>%
arrange(padj)
write.csv(res_df, "deseq2_results.csv", row.names = FALSE)
head(res_df)
This is one of the most misunderstood aspects of DE analysis. There is no universal correct cutoff — it depends on your experimental goals.
padj < 0.1|log2FC| > 0.5padj < 0.05|log2FC| > 1.0padj < 0.01|log2FC| > 2.0
padj cutoff controls the false discovery rate — at padj < 0.05, you expect ~5% of your "significant" genes to be false positives. The log2FC cutoff is about biological relevance — statistically significant but tiny changes (e.g., log2FC = 0.1) may not matter biologically. Always apply both thresholds together.
PCA reveals global patterns and outliers. Samples in the same condition should cluster together.
# Variance-stabilizing transformation (for visualization only, not DE testing)
vsd <- vst(dds, blind = FALSE)
# PCA colored by condition
plotPCA(vsd, intgroup = "condition") +
theme_minimal(base_size = 14) +
ggtitle("PCA of Variance-Stabilized Counts")
~ batch + condition). If samples are total outliers, consider removing them and re-running.
The most informative single plot for DE results: statistical significance (y-axis) vs. effect size (x-axis).
library(EnhancedVolcano) EnhancedVolcano(res_shrunk, lab = rownames(res_shrunk), x = 'log2FoldChange', y = 'padj', title = 'Treated vs Control', subtitle = 'DESeq2 with apeglm shrinkage', pCutoff = 0.05, FCcutoff = 1.0, pointSize = 2.0, labSize = 3.5, legendPosition = 'right', drawConnectors = TRUE, widthConnectors = 0.5)
plotMA(res_shrunk, ylim = c(-4, 4), main = "MA Plot (apeglm shrunken LFC)")
library(pheatmap)
# Euclidean distances between samples
sample_dists <- dist(t(assay(vsd)))
pheatmap(as.matrix(sample_dists),
clustering_distance_rows = sample_dists,
clustering_distance_cols = sample_dists,
color = colorRampPalette(c("navy","white","firebrick3"))(50),
main = "Sample Distance Heatmap")
# Top 30 DE genes by adjusted p-value
top_genes <- head(res_df$gene, 30)
mat <- assay(vsd)[top_genes, ]
mat <- mat - rowMeans(mat) # Center by row mean
pheatmap(mat,
scale = "none",
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = TRUE,
annotation_col = sample_info,
color = colorRampPalette(c("#2166AC","white","#B2182B"))(100),
main = "Top 30 Differentially Expressed Genes")
Once you have your DE gene list, the next step is understanding what biological processes or pathways are enriched.
# Using clusterProfiler for GO enrichment
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db") # Change for your organism
library(clusterProfiler)
library(org.Hs.eg.db)
# Get significant upregulated genes
sig_genes <- res_df %>%
filter(padj < 0.05, log2FoldChange > 1) %>%
pull(gene)
# GO Biological Process enrichment
ego <- enrichGO(gene = sig_genes,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05)
dotplot(ego, showCategory = 15, title = "GO Biological Process Enrichment")
fgsea or clusterProfiler::gseGO().
All three are well-validated and widely published. Here's when to prefer each:
| Tool | Strengths | Best When |
|---|---|---|
| DESeq2 | Conservative, built-in shrinkage, great documentation | Small sample sizes (3–5 per group); default choice |
| edgeR | Fast, flexible, quasi-likelihood F-test | Complex designs; larger sample sizes; speed matters |
| limma-voom | Extremely fast, handles large designs elegantly | Very large datasets (hundreds of samples); microarray background |