WGCNA Gene Co-expression Network Analysis

Construct weighted gene correlation networks from the Maize Ligule Development RNA-seq dataset — soft-power selection, module detection, trait correlation, hub genes, and network export to Cytoscape.

~90 min Maize Ligule (Johnston et al. 2014) Intermediate R
Byte

Overview

Weighted Gene Co-expression Network Analysis (WGCNA) groups genes by correlated expression across samples. Each group (module) represents genes potentially co-regulated or functionally related. WGCNA then correlates module expression levels with external traits to identify which gene programs drive biological outcomes.

Byte
WGCNA key concepts:
  • Soft thresholding power (β): transforms correlation to a weighted adjacency — raises pairwise correlations to power β so the resulting network approximates scale-free topology
  • Module: cluster of highly correlated genes (eigengene = first principal component of the module)
  • Module eigengene (ME): the representative expression profile for a module — used to correlate modules with traits
  • Hub gene: gene with highest connectivity (kME) within its module — most likely to be a key regulatory gene

Tutorial authors: Jennifer Chang & Siva Chudalayandi (Iowa State University GIF). Original WGCNA package: Langfelder & Horvath (2008).

Ideal Directory Structure

WGCNA generates large intermediate R objects (TOM matrices). Save key .RData checkpoints at each major step so you can restart from any point without rerunning the entire pipeline.

wgcna_maize_ligule/
wgcna_maize_ligule/ ├── input/ │ ├── wgcna_input_data.rds # pre-normalized count matrix │ └── maize_ligule_traits.csv # sample metadata / trait values ├── 1_qc/ │ └── sample_clustering_outlier_check.png ├── 2_soft_power/ │ └── soft_power_selection.png ├── 3_network/ │ ├── WGCNA_network.RData # net, moduleColors, MEs, datExpr │ └── gene_dendrogram_modules.png ├── 4_trait_correlation/ │ └── module_trait_heatmap.png ├── 5_hub_genes/ │ └── hub_genes_per_module.csv └── 6_cytoscape/ └── network_edges.txt # for Cytoscape import
💾
Memory tip: The Topological Overlap Matrix (TOM) for 27,000 genes is a 27,000 x 27,000 matrix = ~5.8 GB RAM. Use maxBlockSize to split into blocks if your machine has less than 16 GB RAM. Setting saveTOMs=TRUE writes TOM to disk (~3 GB per block) so you can re-cluster without recomputing.

Dataset

Maize (Zea mays) ligule development RNA-seq from Johnston et al. (2014, The Plant Cell). The ligule is a fringe of cells at the blade-sheath boundary. This dataset profiles gene expression across different developmental stages and tissues including leaf blade, ligule, sheath, auricle, and P6 primordia.

PropertyValue
OrganismZea mays (Maize)
TissuesBlade, Ligule, Sheath, Auricle, P6 primordium
Replicates3 per tissue/stage
Data filewgcna_input_data.rds (pre-normalized counts)
Gene count~27,000 maize genes after filtering

Step 1 — Install WGCNA and Load Data

R — Install and load packages
# Install WGCNA and dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("WGCNA")
install.packages(c("ggplot2", "tidyverse", "pheatmap", "viridis"))

library(WGCNA)
library(ggplot2)
library(tidyverse)
library(pheatmap)

# Allow multi-threading in WGCNA (speeds up significantly on multi-core machines)
enableWGCNAThreads()   # uses all available cores

# Load the pre-processed count matrix
# Rows = genes, Columns = samples
# Download: wget https://bioinformaticsworkbook.org/tutorials/wgcna_input_data.rds
datExpr0 <- readRDS("wgcna_input_data.rds")

# Data should be: rows = samples, columns = genes
# WGCNA expects: datExpr[samples, genes]
# If your data is genes x samples, TRANSPOSE it:
if (ncol(datExpr0) < nrow(datExpr0)) {
    datExpr0 <- t(datExpr0)
}

cat("Samples:", nrow(datExpr0), "\n")
cat("Genes:", ncol(datExpr0), "\n")
cat("Dimensions:", dim(datExpr0), "\n")
Download:

Step 2 — Data QC and Gene Filtering

R — Remove bad samples and low-variance genes
library(WGCNA)

# Check for genes/samples with too many missing values
gsg <- goodSamplesGenes(datExpr0, verbose = 3)
gsg$allOK   # TRUE = all samples and genes pass

if (!gsg$allOK) {
    # Remove flagged genes and samples
    if (sum(!gsg$goodGenes) > 0)
        printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse=", ")))
    if (sum(!gsg$goodSamples) > 0)
        printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse=", ")))
    datExpr0 <- datExpr0[gsg$goodSamples, gsg$goodGenes]
}

# Cluster samples to check for outliers
sampleTree <- hclust(dist(datExpr0), method = "average")
png("sample_clustering_outlier_check.png", width=1200, height=600)
plot(sampleTree, main="Sample clustering to detect outliers",
     sub="", xlab="", cex.lab=1.5, cex.axis=1.5, cex.main=2)
abline(h=15, col="red")  # adjust threshold based on the plot
dev.off()

# Optionally remove outlier samples (those above the cutoff line)
clust <- cutreeStatic(sampleTree, cutHeight=15, minSize=10)
keepSamples <- (clust == 1)
datExpr <- datExpr0[keepSamples, ]

cat("Retained samples:", nrow(datExpr), "\n")
cat("Retained genes:", ncol(datExpr), "\n")
Download:
WGCNA QC functions decoded
goodSamplesGenes()Checks for genes/samples with excessive missing values or zero variance. Both can crash WGCNA's correlation calculations. Returns $goodGenes and $goodSamples logical vectors. Always run this first even if your data looks clean.
📈hclust(dist(datExpr))Hierarchical clustering of samples based on Euclidean distance of gene expression profiles. Outlier samples appear as long branches far from the main cluster. Removing them before network construction prevents distorted modules.
cutreeStatic(sampleTree, cutHeight=15)Cuts the sample dendrogram at a fixed height to identify clusters. Samples in cluster 1 (the main group) are kept; isolated outlier samples (in other clusters) are removed. Adjust cutHeight based on your dendrogram plot.

Step 3 — Soft-Power Selection

WGCNA raises all pairwise gene correlations to a power β to create a weighted network. The correct β is chosen so the network approximates scale-free topology (a hallmark of real biological networks):

R — Soft-power threshold selection
library(WGCNA)

# Test soft-thresholding powers from 1 to 30
powers <- c(1:10, seq(12, 30, by=2))

# pickSoftThreshold tests each power and measures how well the
# resulting network fits a scale-free topology
sft <- pickSoftThreshold(
    datExpr,
    powerVector = powers,
    verbose = 5,
    networkType = "signed"   # "signed" preserves direction of correlation
)

# Plot the results
png("soft_power_selection.png", width=1200, height=600)
par(mfrow=c(1,2))

# Scale-free topology fit index (R^2)
# Choose the LOWEST power where the curve flattens above R^2 = 0.80
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)", ylab="Scale Free Topology Model Fit R^2",
     main="Scale independence", type="n")
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers, col="red")
abline(h=0.80, col="red", lty=2)

# Mean connectivity
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)", ylab="Mean Connectivity",
     main="Mean connectivity", type="n")
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, col="red")
dev.off()

# Choose the soft power
# Typically: the lowest power where R^2 >= 0.80
# For this dataset: softPower = 14 (signed network, plant data)
softPower <- sft$powerEstimate
cat("Recommended soft power:", softPower, "\n")
# If powerEstimate is NA, choose manually from the plot
Download:
Soft-power selection decoded
🔨pickSoftThreshold()Tests each candidate power and computes (1) R² of the scale-free topology fit and (2) mean network connectivity. Returns $powerEstimate = the recommended power. If it returns NA, choose the lowest power where R² ≥ 0.80 manually from the plot.
🖉networkType = "signed"Signed networks treat positively and negatively correlated gene pairs differently. The adjacency is (0.5 + 0.5 * cor)^power, ranging 0–1. Unsigned uses |cor|^power. For biological interpretation, signed is almost always better: co-expressed vs anti-correlated genes belong in different modules.
📈abline(h=0.80)The R² = 0.80 threshold is the standard cutoff for “good enough” scale-free topology. Higher powers give better fit but lower connectivity. The goal is to find the power where the network transitions from random to scale-free behavior.
Signed vs unsigned networks: Use signed for RNA-seq data — it distinguishes positively and negatively correlated genes (they should be in different modules). Unsigned networks merge positively and negatively correlated genes into the same module, which is biologically misleading. Signed networks typically require a higher soft power (12–20 vs 6–12 for unsigned).

Step 4 — Build Network and Detect Modules

R — blockwiseModules (memory-efficient network construction)
library(WGCNA)

softPower <- 14   # from Step 3

# Build the correlation network and detect modules in one step
# blockwiseModules processes data in blocks to handle large gene sets
net <- blockwiseModules(
    datExpr,
    power = softPower,
    networkType = "signed",
    TOMType = "signed",         # Topological Overlap Matrix type
    minModuleSize = 30,         # minimum genes per module
    reassignThreshold = 0,
    mergeCutHeight = 0.25,      # merge modules with >75% similar eigengenes
    numericLabels = TRUE,
    pamRespectsDendro = FALSE,
    saveTOMs = FALSE,
    verbose = 3,
    maxBlockSize = 15000        # adjust based on available RAM
)

# Module summary
table(net$colors)
# 0 = "grey" = unassigned genes
# 1 = largest module, 2 = second largest, etc.

cat("Number of modules detected:", max(net$colors), "\n")
cat("Unassigned genes (grey module):", sum(net$colors == 0), "\n")

# Convert numeric labels to color labels for plotting
mergedColors <- labels2colors(net$colors)

# Plot the dendrogram with module color assignment
png("gene_dendrogram_modules.png", width=1400, height=800)
plotDendroAndColors(
    net$dendrograms[[1]],
    mergedColors[net$blockGenes[[1]]],
    "Module colors",
    dendroLabels = FALSE,
    hang = 0.03,
    addGuide = TRUE,
    guideHang = 0.05,
    main = "Gene dendrogram and module colors"
)
dev.off()

# Save module membership for all genes
moduleColors <- mergedColors
moduleLabels <- net$colors
MEs <- net$MEs   # Module Eigengenes (one per sample per module)

save(net, moduleColors, moduleLabels, MEs, datExpr,
     file="WGCNA_network.RData")
Download:
blockwiseModules parameters decoded
🧰TOMType = "signed"The Topological Overlap Matrix (TOM) measures shared network neighbors between gene pairs, not just pairwise correlation. Signed TOM preserves the direction of correlation when building this overlap measure. More biologically informative than unsigned TOM.
📈minModuleSize = 30Minimum number of genes required to form a module. Smaller values = more, smaller modules. Larger values = fewer, larger modules. 30 is a commonly used default for RNA-seq. Increase to 50–100 for very large datasets to avoid spurious small modules.
mergeCutHeight = 0.25Modules with eigengene correlation ≥ (1 − 0.25) = 0.75 are merged after initial detection. This reduces over-splitting of highly similar modules. Decrease to merge more aggressively; increase to keep more distinct modules.
💾maxBlockSize = 15000Maximum genes to process in one block. If you have more genes, WGCNA splits them into multiple blocks processed sequentially. Reduce if you get out-of-memory errors. Each block requires approximately maxBlockSize² × 8 bytes of RAM.
🌈
What is the grey module? Module 0 (grey) is where WGCNA puts genes it couldn’t assign to any real module. These are typically lowly expressed, noisy, or simply not co-expressed with other genes. A large grey module (>30% of genes) suggests your soft power is too high or your data needs more filtering.

Step 5 — Module–Trait Correlation

R — Correlate module eigengenes with traits
library(WGCNA)
load("WGCNA_network.RData")

# Load trait data — numeric matrix (samples x traits)
# Traits can be: tissue type (dummy-coded), developmental stage, etc.
# For maize ligule: tissue types encoded as 0/1
traitData <- read.csv("maize_ligule_traits.csv", row.names=1)
# Match sample order with expression data
traitData <- traitData[match(rownames(datExpr), rownames(traitData)), ]

# Recalculate MEs with proper ordering
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs <- orderMEs(MEs0)

# Calculate Pearson correlation between module eigengenes and traits
moduleTraitCor <- cor(MEs, traitData, use="p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nrow(datExpr))

# Visualize as heatmap with correlation + significance
png("module_trait_heatmap.png", width=1200, height=1000)
textMatrix <- paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep="")
dim(textMatrix) <- dim(moduleTraitCor)

par(mar = c(6, 8.5, 3, 3))
labeledHeatmap(
    Matrix = moduleTraitCor,
    xLabels = names(traitData),
    yLabels = names(MEs),
    ySymbols = names(MEs),
    colorLabels = FALSE,
    colors = blueWhiteRed(50),
    textMatrix = textMatrix,
    setStdMargins = FALSE,
    cex.text = 0.6,
    zlim = c(-1,1),
    main = "Module-Trait Relationships"
)
dev.off()

cat("Module-trait correlation heatmap saved.\n")
cat("Red = positive correlation, Blue = negative correlation\n")
cat("Numbers show Pearson r and p-value\n")
Download:
Module-trait correlation decoded
📈moduleEigengenes()Computes the eigengene for each module: the first principal component of all gene expression profiles within the module. The eigengene captures the dominant expression pattern of the module across samples. Treat it as a “summary gene” for the module.
📊cor(MEs, traitData)Pearson correlation between each module eigengene and each trait variable. Positive correlation = module is upregulated when the trait value is high. Negative = downregulated. Values close to ±1 with small p-values are biologically meaningful.
🌡blueWhiteRed(50)Color palette: blue = strong negative correlation, white = no correlation, red = strong positive correlation. Perfect for correlation heatmaps because the directionality is immediately intuitive.
🎯
Finding the most interesting module: Look for modules with the strongest correlation to your trait of interest AND the smallest p-value. Then use the module membership (kME) scores to find hub genes within that module. Hub genes are most likely to be key regulatory nodes driving the trait.

Step 6 — Identify Hub Genes

Hub genes have high module membership (kME) — they are the most strongly connected genes within their module and are the most likely candidates for key regulators:

R — Module membership (kME) and hub gene identification
library(WGCNA)
load("WGCNA_network.RData")

# Calculate gene-module membership (kME): correlation of each gene
# with each module eigengene
# kME[gene, module] = correlation of gene expression with module eigengene
geneModuleMembership <- as.data.frame(cor(datExpr, MEs, use="p"))
MMPvalue <- as.data.frame(corPvalueStudent(
    as.matrix(geneModuleMembership), nrow(datExpr)))

# Rename columns to match module names
names(geneModuleMembership) <- paste("MM", substring(names(MEs), 3), sep="")
names(MMPvalue) <- paste("p.MM", substring(names(MEs), 3), sep="")

# Build comprehensive gene info table
geneInfo <- data.frame(
    gene = colnames(datExpr),
    module = moduleColors
)

# Add kME for each module
geneInfo <- cbind(geneInfo, geneModuleMembership, MMPvalue)

# Find top hub genes per module (highest |kME| for own module)
hub_genes <- geneInfo %>%
    group_by(module) %>%
    arrange(desc(abs(!!sym(paste0("MM", module[1]))))) %>%
    slice_head(n=10) %>%
    ungroup()

cat("Top hub genes per module:\n")
print(hub_genes %>% select(gene, module) %>% head(30))

write.csv(geneInfo, "gene_module_membership.csv", row.names=FALSE)
write.csv(hub_genes, "hub_genes_per_module.csv", row.names=FALSE)

# Example: find hub genes in the module most correlated with "ligule" trait
# Identify the "ligule" module first from module-trait heatmap
# Then extract its top 50 hub genes:
TRAIT_MODULE <- "turquoise"   # replace with the module of interest
module_hubs <- geneInfo %>%
    filter(module == TRAIT_MODULE) %>%
    arrange(desc(abs(!!sym(paste0("MM", TRAIT_MODULE))))) %>%
    head(50)
cat("\nTop 50 hub genes in", TRAIT_MODULE, "module:\n")
print(module_hubs$gene)
Download:

Step 7 — Export to Cytoscape

R — Export network edge list for Cytoscape visualization
library(WGCNA)
load("WGCNA_network.RData")

# Choose module(s) to export
# Focus on the most biologically interesting module(s)
EXPORT_MODULE <- "turquoise"
module_genes <- colnames(datExpr)[moduleColors == EXPORT_MODULE]

cat("Genes in module:", length(module_genes), "\n")

# Compute Topological Overlap Matrix for these genes
# (adjacency between all pairs of genes in the module)
subExpr <- datExpr[, module_genes]
subTOM <- TOMsimilarityFromExpr(subExpr, power=softPower, networkType="signed")
dimnames(subTOM) <- list(module_genes, module_genes)

# Export edge list with weight > threshold (keeps network manageable)
WEIGHT_THRESHOLD <- 0.1

cyt_export <- exportNetworkToCytoscape(
    subTOM,
    edgeFile = paste0("Cytoscape_edges_", EXPORT_MODULE, ".txt"),
    nodeFile = paste0("Cytoscape_nodes_", EXPORT_MODULE, ".txt"),
    weighted = TRUE,
    threshold = WEIGHT_THRESHOLD,
    nodeNames = module_genes,
    altNodeNames = module_genes,
    nodeAttr = moduleColors[moduleColors == EXPORT_MODULE]
)

cat("Edges exported:", nrow(cyt_export$edgeData), "\n")
cat("Nodes exported:", nrow(cyt_export$nodeData), "\n")
cat("\nImport into Cytoscape:")
cat("File > Import > Network from File > Cytoscape_edges_", EXPORT_MODULE, ".txt\n")
cat("Then File > Import > Table from File > Cytoscape_nodes_", EXPORT_MODULE, ".txt\n")
Download:
Byte
WGCNA overview image: The pipeline takes a normalized count matrix, clusters genes by expression patterns, identifies modules (colored clusters), correlates module activity with traits, and extracts hub genes from trait-associated modules.

Troubleshooting

blockwiseModules: runs out of memory

Reduce maxBlockSize (e.g., to 5000 genes). The function processes genes in blocks — smaller blocks use less RAM but run slightly slower. Also ensure you have filtered out low-variance genes before running (typically keep top 25–50% most variable genes with apply(datExpr, 2, var)).

pickSoftThreshold returns NA or recommends power > 20

This usually means your dataset has too few samples (<15) or very high expression correlation across all genes (batch effect or highly similar samples). For <20 samples, use networkType = "unsigned" and lower powers (6–12). For batched data, correct with ComBat or limma's removeBatchEffect first.

All genes go into one giant module

Decrease mergeCutHeight (e.g., 0.1 instead of 0.25) to prevent aggressive merging. Also ensure your soft power is appropriate — too low a power creates a dense network where everything connects. Check the scale-free fit plot (R² should be >0.80 at your chosen power).