Discover modules of co-expressed genes, relate them to traits, identify hub genes, and export publication-ready networks — with every decision point explained.
Weighted Gene Correlation Network Analysis clusters genes into modules based on their co-expression patterns across samples. Instead of asking "which genes change between two groups?" (that's DE analysis), WGCNA asks: "which groups of genes behave similarly across all conditions, and do those groups relate to specific traits?"
install.packages(c("WGCNA", "tidyverse", "magrittr"))
# For normalization
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "genefilter"))
library(WGCNA)
library(tidyverse)
library(DESeq2)
WGCNA expects a samples × genes matrix of normalized expression values (not raw counts). We use DESeq2's variance-stabilizing transformation (VST) to normalize, then subset to the most variable genes for computational efficiency.
# Load your raw count data (genes x samples)
raw_counts <- read.delim("counts.txt", row.names = 1)
# Build a minimal DESeq2 object for normalization
sample_meta <- data.frame(
row.names = colnames(raw_counts),
group = factor(gsub("[_.-].*", "", colnames(raw_counts)))
)
dds <- DESeqDataSetFromMatrix(
countData = round(as.matrix(raw_counts)),
colData = sample_meta,
design = ~ group
)
dds <- DESeq(dds)
# Variance-Stabilizing Transformation
vsd <- getVarianceStabilizedData(dds)
# Subset to top variable genes
rv <- matrixStats::rowVars(vsd)
q_threshold <- quantile(rv, 0.75) # Keep top 25% most variable genes
expr_mat <- vsd[rv > q_threshold, ]
cat("Genes for WGCNA:", nrow(expr_mat), "\n")
# CRITICAL: Transpose so rows = samples, columns = genes
input_mat <- t(expr_mat)
dim(input_mat)
maxBlockSize tuning and sufficient RAM# Hierarchical clustering to detect outlier samples
sample_tree <- hclust(dist(input_mat), method = "average")
plot(sample_tree, main = "Sample Clustering to Detect Outliers",
sub = "", xlab = "", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)
# Draw a cut line — samples above this may be outliers
abline(h = 150, col = "red") # Adjust height based on your data
goodSamplesGenes() as an automated check.
This is the most critical decision in WGCNA. The soft power β determines how aggressively weak correlations are suppressed. A higher β creates a sparser, more modular network.
allowWGCNAThreads()
powers <- c(1:20)
sft <- pickSoftThreshold(input_mat, powerVector = powers, verbose = 5)
# Two-panel diagnostic plot
par(mfrow = c(1, 2))
# Panel 1: Scale-free topology fit index vs power
plot(sft$fitIndices[, 1],
-sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
xlab = "Soft Threshold (power)",
ylab = "Scale Free Topology Model Fit (signed R²)",
main = "Scale Independence",
type = "n")
text(sft$fitIndices[, 1],
-sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
labels = powers, col = "red")
abline(h = 0.85, col = "blue", lty = 2)
# Panel 2: Mean connectivity vs power
plot(sft$fitIndices[, 1], sft$fitIndices[, 5],
xlab = "Soft Threshold (power)",
ylab = "Mean Connectivity",
main = "Mean Connectivity",
type = "n")
text(sft$fitIndices[, 1], sft$fitIndices[, 5],
labels = powers, col = "red")
picked_power <- 9 # Replace with your chosen value # Fix namespace conflict between stats::cor and WGCNA::cor temp_cor <- cor cor <- WGCNA::cor netwk <- blockwiseModules( input_mat, power = picked_power, networkType = "signed", # --- Tree & clustering --- deepSplit = 2, # 0-4; higher = more smaller modules minModuleSize = 30, # minimum genes per module maxBlockSize = 20000, # genes per computational block # --- Module merging --- mergeCutHeight = 0.25, # modules with correlation > 0.75 are merged # --- Reassignment --- pamRespectsDendro = FALSE, reassignThreshold = 0, # --- Save TOM for later reuse --- saveTOMs = TRUE, saveTOMFileBase = "TOM", numericLabels = TRUE, verbose = 3 ) cor <- temp_cor # Restore default cor function table(netwk$colors) # Number of genes per module
maxBlockSize is smaller than your gene count, WGCNA splits genes into blocks and merges modules afterwards — this is faster but slightly less accurate.
| Parameter | What It Controls | How to Tune |
|---|---|---|
power | Soft-thresholding exponent | Use value from Step 2 diagnostic plot |
networkType | Signed, unsigned, or signed hybrid | "signed" for most biological applications |
deepSplit | Sensitivity of module splitting (0–4) | 2 = default. Increase to 3 or 4 for finer-grained modules. Decrease to 1 for fewer, larger modules. |
minModuleSize | Smallest allowed module | 30 is standard. Increase to 50–100 for cleaner modules; decrease to 15–20 if you expect small regulatory modules. |
mergeCutHeight | Dissimilarity threshold for merging similar modules | 0.25 = merge modules with eigengene correlation > 0.75. Lower value = more merging. |
maxBlockSize | Genes per computation block | Set to total gene count if RAM allows (16GB+ for ~15k genes). Otherwise, use 5000–10000. |
module_colors <- labels2colors(netwk$colors) plotDendroAndColors( netwk$dendrograms[[1]], module_colors[netwk$blockGenes[[1]]], "Module Colors", dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, main = "Gene Dendrogram with Module Colors" )
# Create a gene → module mapping
module_df <- data.frame(
gene_id = names(netwk$colors),
module = labels2colors(netwk$colors),
stringsAsFactors = FALSE
)
# Count genes per module
module_df %>% count(module, sort = TRUE)
# Save for later use
write.table(module_df, "gene_module_assignments.tsv",
sep = "\t", row.names = FALSE, quote = FALSE)
The key biological question: which modules associate with which experimental conditions or phenotypes? WGCNA computes a module eigengene for each module and correlates it against your trait data.
# Calculate module eigengenes
MEs0 <- moduleEigengenes(input_mat, module_colors)$eigengenes
MEs <- orderMEs(MEs0)
# Create a numeric trait matrix (encode conditions as 0/1)
# Adapt this to your actual experimental groups
traits <- sample_meta
traits$group <- as.numeric(factor(traits$group)) - 1
# Correlate eigengenes with traits
module_trait_cor <- cor(MEs, traits, use = "p")
module_trait_pval <- corPvalueStudent(module_trait_cor, nrow(input_mat))
# Heatmap of module-trait relationships
textMatrix <- paste0(signif(module_trait_cor, 2),
"\n(", signif(module_trait_pval, 1), ")")
dim(textMatrix) <- dim(module_trait_cor)
labeledHeatmap(
Matrix = module_trait_cor,
xLabels = colnames(traits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.6,
zlim = c(-1, 1),
main = "Module-Trait Relationships"
)
modules_of_interest <- c("turquoise", "blue", "green")
sub_genes <- module_df %>% filter(module %in% modules_of_interest)
sub_expr <- expr_mat[sub_genes$gene_id, ]
# Tidy for ggplot
plot_df <- as.data.frame(sub_expr) %>%
mutate(gene_id = rownames(.)) %>%
pivot_longer(-gene_id, names_to = "sample", values_to = "expr") %>%
left_join(module_df, by = "gene_id")
ggplot(plot_df, aes(x = sample, y = expr, group = gene_id)) +
geom_line(aes(color = module), alpha = 0.15, linewidth = 0.3) +
facet_grid(rows = vars(module)) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7)) +
labs(title = "Expression Profiles by Module",
x = "Sample", y = "Normalized Expression") +
scale_color_identity()
Hub genes are the most highly connected genes within a module. They often play central regulatory roles and are prime candidates for experimental validation.
# Module membership (kME) = correlation of each gene with its module eigengene
kME <- signedKME(input_mat, MEs)
# For a specific module (e.g., turquoise):
target_module <- "turquoise"
target_genes <- module_df %>% filter(module == target_module) %>% pull(gene_id)
kme_col <- paste0("kME", target_module)
hub_df <- data.frame(
gene_id = target_genes,
kME = kME[target_genes, kme_col]
) %>%
arrange(desc(kME))
# Top 10 hub genes
head(hub_df, 10)
For interactive network visualization, export the module network to Cytoscape or igraph.
# Recalculate TOM for modules of interest
target_genes <- module_df %>%
filter(module %in% modules_of_interest) %>%
pull(gene_id)
target_expr <- expr_mat[target_genes, ]
TOM <- TOMsimilarityFromExpr(t(target_expr), power = picked_power)
rownames(TOM) <- rownames(target_expr)
colnames(TOM) <- rownames(target_expr)
# Build edge list
edge_list <- as.data.frame(TOM) %>%
mutate(gene1 = rownames(.)) %>%
pivot_longer(-gene1, names_to = "gene2", values_to = "weight") %>%
filter(gene1 != gene2, weight > 0.02) %>% # filter weak edges
left_join(module_df, by = c("gene1" = "gene_id")) %>%
rename(module1 = module) %>%
left_join(module_df, by = c("gene2" = "gene_id")) %>%
rename(module2 = module)
# Save for Cytoscape import
write.table(edge_list, "network_edge_list.tsv",
sep = "\t", row.names = FALSE, quote = FALSE)
cat("Edges exported:", nrow(edge_list), "\n")
WGCNA has many parameters that affect module detection. Here's a systematic approach to tuning:
| If You Want... | Adjust This | Direction |
|---|---|---|
| More modules (finer resolution) | deepSplit | Increase to 3 or 4 |
| Fewer modules (broader grouping) | mergeCutHeight | Increase (e.g., 0.4 merges more aggressively) |
| Remove tiny modules | minModuleSize | Increase to 50 or 100 |
| Tighter, more connected modules | power | Increase slightly (more stringent adjacency) |
| Faster computation | maxBlockSize | Decrease (genes split into smaller blocks, but slight accuracy trade-off) |
| Better biological interpretability | networkType | Use "signed" |
blockwiseModules