Construct weighted gene correlation networks from the Maize Ligule Development RNA-seq dataset — soft-power selection, module detection, trait correlation, hub genes, and network export to Cytoscape.
Weighted Gene Co-expression Network Analysis (WGCNA) groups genes by correlated expression across samples. Each group (module) represents genes potentially co-regulated or functionally related. WGCNA then correlates module expression levels with external traits to identify which gene programs drive biological outcomes.
Tutorial authors: Jennifer Chang & Siva Chudalayandi (Iowa State University GIF). Original WGCNA package: Langfelder & Horvath (2008).
WGCNA generates large intermediate R objects (TOM matrices). Save key .RData checkpoints at each major step so you can restart from any point without rerunning the entire pipeline.
maxBlockSize to split into blocks if your machine has less than 16 GB RAM. Setting saveTOMs=TRUE writes TOM to disk (~3 GB per block) so you can re-cluster without recomputing.Maize (Zea mays) ligule development RNA-seq from Johnston et al. (2014, The Plant Cell). The ligule is a fringe of cells at the blade-sheath boundary. This dataset profiles gene expression across different developmental stages and tissues including leaf blade, ligule, sheath, auricle, and P6 primordia.
| Property | Value |
|---|---|
| Organism | Zea mays (Maize) |
| Tissues | Blade, Ligule, Sheath, Auricle, P6 primordium |
| Replicates | 3 per tissue/stage |
| Data file | wgcna_input_data.rds (pre-normalized counts) |
| Gene count | ~27,000 maize genes after filtering |
# Install WGCNA and dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("WGCNA")
install.packages(c("ggplot2", "tidyverse", "pheatmap", "viridis"))
library(WGCNA)
library(ggplot2)
library(tidyverse)
library(pheatmap)
# Allow multi-threading in WGCNA (speeds up significantly on multi-core machines)
enableWGCNAThreads() # uses all available cores
# Load the pre-processed count matrix
# Rows = genes, Columns = samples
# Download: wget https://bioinformaticsworkbook.org/tutorials/wgcna_input_data.rds
datExpr0 <- readRDS("wgcna_input_data.rds")
# Data should be: rows = samples, columns = genes
# WGCNA expects: datExpr[samples, genes]
# If your data is genes x samples, TRANSPOSE it:
if (ncol(datExpr0) < nrow(datExpr0)) {
datExpr0 <- t(datExpr0)
}
cat("Samples:", nrow(datExpr0), "\n")
cat("Genes:", ncol(datExpr0), "\n")
cat("Dimensions:", dim(datExpr0), "\n")
library(WGCNA)
# Check for genes/samples with too many missing values
gsg <- goodSamplesGenes(datExpr0, verbose = 3)
gsg$allOK # TRUE = all samples and genes pass
if (!gsg$allOK) {
# Remove flagged genes and samples
if (sum(!gsg$goodGenes) > 0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse=", ")))
if (sum(!gsg$goodSamples) > 0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse=", ")))
datExpr0 <- datExpr0[gsg$goodSamples, gsg$goodGenes]
}
# Cluster samples to check for outliers
sampleTree <- hclust(dist(datExpr0), method = "average")
png("sample_clustering_outlier_check.png", width=1200, height=600)
plot(sampleTree, main="Sample clustering to detect outliers",
sub="", xlab="", cex.lab=1.5, cex.axis=1.5, cex.main=2)
abline(h=15, col="red") # adjust threshold based on the plot
dev.off()
# Optionally remove outlier samples (those above the cutoff line)
clust <- cutreeStatic(sampleTree, cutHeight=15, minSize=10)
keepSamples <- (clust == 1)
datExpr <- datExpr0[keepSamples, ]
cat("Retained samples:", nrow(datExpr), "\n")
cat("Retained genes:", ncol(datExpr), "\n")
| ✅goodSamplesGenes() | Checks for genes/samples with excessive missing values or zero variance. Both can crash WGCNA's correlation calculations. Returns $goodGenes and $goodSamples logical vectors. Always run this first even if your data looks clean. |
| 📈hclust(dist(datExpr)) | Hierarchical clustering of samples based on Euclidean distance of gene expression profiles. Outlier samples appear as long branches far from the main cluster. Removing them before network construction prevents distorted modules. |
| ✂cutreeStatic(sampleTree, cutHeight=15) | Cuts the sample dendrogram at a fixed height to identify clusters. Samples in cluster 1 (the main group) are kept; isolated outlier samples (in other clusters) are removed. Adjust cutHeight based on your dendrogram plot. |
WGCNA raises all pairwise gene correlations to a power β to create a weighted network. The correct β is chosen so the network approximates scale-free topology (a hallmark of real biological networks):
library(WGCNA)
# Test soft-thresholding powers from 1 to 30
powers <- c(1:10, seq(12, 30, by=2))
# pickSoftThreshold tests each power and measures how well the
# resulting network fits a scale-free topology
sft <- pickSoftThreshold(
datExpr,
powerVector = powers,
verbose = 5,
networkType = "signed" # "signed" preserves direction of correlation
)
# Plot the results
png("soft_power_selection.png", width=1200, height=600)
par(mfrow=c(1,2))
# Scale-free topology fit index (R^2)
# Choose the LOWEST power where the curve flattens above R^2 = 0.80
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)", ylab="Scale Free Topology Model Fit R^2",
main="Scale independence", type="n")
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers, col="red")
abline(h=0.80, col="red", lty=2)
# Mean connectivity
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)", ylab="Mean Connectivity",
main="Mean connectivity", type="n")
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, col="red")
dev.off()
# Choose the soft power
# Typically: the lowest power where R^2 >= 0.80
# For this dataset: softPower = 14 (signed network, plant data)
softPower <- sft$powerEstimate
cat("Recommended soft power:", softPower, "\n")
# If powerEstimate is NA, choose manually from the plot
| 🔨pickSoftThreshold() | Tests each candidate power and computes (1) R² of the scale-free topology fit and (2) mean network connectivity. Returns $powerEstimate = the recommended power. If it returns NA, choose the lowest power where R² ≥ 0.80 manually from the plot. |
| 🖉networkType = "signed" | Signed networks treat positively and negatively correlated gene pairs differently. The adjacency is (0.5 + 0.5 * cor)^power, ranging 0–1. Unsigned uses |cor|^power. For biological interpretation, signed is almost always better: co-expressed vs anti-correlated genes belong in different modules. |
| 📈abline(h=0.80) | The R² = 0.80 threshold is the standard cutoff for “good enough” scale-free topology. Higher powers give better fit but lower connectivity. The goal is to find the power where the network transitions from random to scale-free behavior. |
signed for RNA-seq data — it distinguishes positively and negatively correlated genes (they should be in different modules). Unsigned networks merge positively and negatively correlated genes into the same module, which is biologically misleading. Signed networks typically require a higher soft power (12–20 vs 6–12 for unsigned).library(WGCNA)
softPower <- 14 # from Step 3
# Build the correlation network and detect modules in one step
# blockwiseModules processes data in blocks to handle large gene sets
net <- blockwiseModules(
datExpr,
power = softPower,
networkType = "signed",
TOMType = "signed", # Topological Overlap Matrix type
minModuleSize = 30, # minimum genes per module
reassignThreshold = 0,
mergeCutHeight = 0.25, # merge modules with >75% similar eigengenes
numericLabels = TRUE,
pamRespectsDendro = FALSE,
saveTOMs = FALSE,
verbose = 3,
maxBlockSize = 15000 # adjust based on available RAM
)
# Module summary
table(net$colors)
# 0 = "grey" = unassigned genes
# 1 = largest module, 2 = second largest, etc.
cat("Number of modules detected:", max(net$colors), "\n")
cat("Unassigned genes (grey module):", sum(net$colors == 0), "\n")
# Convert numeric labels to color labels for plotting
mergedColors <- labels2colors(net$colors)
# Plot the dendrogram with module color assignment
png("gene_dendrogram_modules.png", width=1400, height=800)
plotDendroAndColors(
net$dendrograms[[1]],
mergedColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05,
main = "Gene dendrogram and module colors"
)
dev.off()
# Save module membership for all genes
moduleColors <- mergedColors
moduleLabels <- net$colors
MEs <- net$MEs # Module Eigengenes (one per sample per module)
save(net, moduleColors, moduleLabels, MEs, datExpr,
file="WGCNA_network.RData")
| 🧰TOMType = "signed" | The Topological Overlap Matrix (TOM) measures shared network neighbors between gene pairs, not just pairwise correlation. Signed TOM preserves the direction of correlation when building this overlap measure. More biologically informative than unsigned TOM. |
| 📈minModuleSize = 30 | Minimum number of genes required to form a module. Smaller values = more, smaller modules. Larger values = fewer, larger modules. 30 is a commonly used default for RNA-seq. Increase to 50–100 for very large datasets to avoid spurious small modules. |
| ✂mergeCutHeight = 0.25 | Modules with eigengene correlation ≥ (1 − 0.25) = 0.75 are merged after initial detection. This reduces over-splitting of highly similar modules. Decrease to merge more aggressively; increase to keep more distinct modules. |
| 💾maxBlockSize = 15000 | Maximum genes to process in one block. If you have more genes, WGCNA splits them into multiple blocks processed sequentially. Reduce if you get out-of-memory errors. Each block requires approximately maxBlockSize² × 8 bytes of RAM. |
library(WGCNA)
load("WGCNA_network.RData")
# Load trait data — numeric matrix (samples x traits)
# Traits can be: tissue type (dummy-coded), developmental stage, etc.
# For maize ligule: tissue types encoded as 0/1
traitData <- read.csv("maize_ligule_traits.csv", row.names=1)
# Match sample order with expression data
traitData <- traitData[match(rownames(datExpr), rownames(traitData)), ]
# Recalculate MEs with proper ordering
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)
# Calculate Pearson correlation between module eigengenes and traits
moduleTraitCor <- cor(MEs, traitData, use="p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nrow(datExpr))
# Visualize as heatmap with correlation + significance
png("module_trait_heatmap.png", width=1200, height=1000)
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep="")
dim(textMatrix) <- dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3))
labeledHeatmap(
Matrix = moduleTraitCor,
xLabels = names(traitData),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.6,
zlim = c(-1,1),
main = "Module-Trait Relationships"
)
dev.off()
cat("Module-trait correlation heatmap saved.\n")
cat("Red = positive correlation, Blue = negative correlation\n")
cat("Numbers show Pearson r and p-value\n")
| 📈moduleEigengenes() | Computes the eigengene for each module: the first principal component of all gene expression profiles within the module. The eigengene captures the dominant expression pattern of the module across samples. Treat it as a “summary gene” for the module. |
| 📊cor(MEs, traitData) | Pearson correlation between each module eigengene and each trait variable. Positive correlation = module is upregulated when the trait value is high. Negative = downregulated. Values close to ±1 with small p-values are biologically meaningful. |
| 🌡blueWhiteRed(50) | Color palette: blue = strong negative correlation, white = no correlation, red = strong positive correlation. Perfect for correlation heatmaps because the directionality is immediately intuitive. |
Hub genes have high module membership (kME) — they are the most strongly connected genes within their module and are the most likely candidates for key regulators:
library(WGCNA)
load("WGCNA_network.RData")
# Calculate gene-module membership (kME): correlation of each gene
# with each module eigengene
# kME[gene, module] = correlation of gene expression with module eigengene
geneModuleMembership <- as.data.frame(cor(datExpr, MEs, use="p"))
MMPvalue <- as.data.frame(corPvalueStudent(
as.matrix(geneModuleMembership), nrow(datExpr)))
# Rename columns to match module names
names(geneModuleMembership) <- paste("MM", substring(names(MEs), 3), sep="")
names(MMPvalue) <- paste("p.MM", substring(names(MEs), 3), sep="")
# Build comprehensive gene info table
geneInfo <- data.frame(
gene = colnames(datExpr),
module = moduleColors
)
# Add kME for each module
geneInfo <- cbind(geneInfo, geneModuleMembership, MMPvalue)
# Find top hub genes per module (highest |kME| for own module)
hub_genes <- geneInfo %>%
group_by(module) %>%
arrange(desc(abs(!!sym(paste0("MM", module[1]))))) %>%
slice_head(n=10) %>%
ungroup()
cat("Top hub genes per module:\n")
print(hub_genes %>% select(gene, module) %>% head(30))
write.csv(geneInfo, "gene_module_membership.csv", row.names=FALSE)
write.csv(hub_genes, "hub_genes_per_module.csv", row.names=FALSE)
# Example: find hub genes in the module most correlated with "ligule" trait
# Identify the "ligule" module first from module-trait heatmap
# Then extract its top 50 hub genes:
TRAIT_MODULE <- "turquoise" # replace with the module of interest
module_hubs <- geneInfo %>%
filter(module == TRAIT_MODULE) %>%
arrange(desc(abs(!!sym(paste0("MM", TRAIT_MODULE))))) %>%
head(50)
cat("\nTop 50 hub genes in", TRAIT_MODULE, "module:\n")
print(module_hubs$gene)
library(WGCNA)
load("WGCNA_network.RData")
# Choose module(s) to export
# Focus on the most biologically interesting module(s)
EXPORT_MODULE <- "turquoise"
module_genes <- colnames(datExpr)[moduleColors == EXPORT_MODULE]
cat("Genes in module:", length(module_genes), "\n")
# Compute Topological Overlap Matrix for these genes
# (adjacency between all pairs of genes in the module)
subExpr <- datExpr[, module_genes]
subTOM <- TOMsimilarityFromExpr(subExpr, power=softPower, networkType="signed")
dimnames(subTOM) <- list(module_genes, module_genes)
# Export edge list with weight > threshold (keeps network manageable)
WEIGHT_THRESHOLD <- 0.1
cyt_export <- exportNetworkToCytoscape(
subTOM,
edgeFile = paste0("Cytoscape_edges_", EXPORT_MODULE, ".txt"),
nodeFile = paste0("Cytoscape_nodes_", EXPORT_MODULE, ".txt"),
weighted = TRUE,
threshold = WEIGHT_THRESHOLD,
nodeNames = module_genes,
altNodeNames = module_genes,
nodeAttr = moduleColors[moduleColors == EXPORT_MODULE]
)
cat("Edges exported:", nrow(cyt_export$edgeData), "\n")
cat("Nodes exported:", nrow(cyt_export$nodeData), "\n")
cat("\nImport into Cytoscape:")
cat("File > Import > Network from File > Cytoscape_edges_", EXPORT_MODULE, ".txt\n")
cat("Then File > Import > Table from File > Cytoscape_nodes_", EXPORT_MODULE, ".txt\n")
Reduce maxBlockSize (e.g., to 5000 genes). The function processes genes in blocks — smaller blocks use less RAM but run slightly slower. Also ensure you have filtered out low-variance genes before running (typically keep top 25–50% most variable genes with apply(datExpr, 2, var)).
This usually means your dataset has too few samples (<15) or very high expression correlation across all genes (batch effect or highly similar samples). For <20 samples, use networkType = "unsigned" and lower powers (6–12). For batched data, correct with ComBat or limma's removeBatchEffect first.
Decrease mergeCutHeight (e.g., 0.1 instead of 0.25) to prevent aggressive merging. Also ensure your soft power is appropriate — too low a power creates a dense network where everything connects. Check the scale-free fit plot (R² should be >0.80 at your chosen power).