Gene Co-Expression Network Analysis with WGCNA

? What Is WGCNA?

Weighted Gene Correlation Network Analysis clusters genes into modules based on their co-expression patterns across samples. Instead of asking "which genes change between two groups?" (that's DE analysis), WGCNA asks: "which groups of genes behave similarly across all conditions, and do those groups relate to specific traits?"

Core Concepts

Adjacency matrix — pairwise correlations raised to a "soft power" to emphasize strong connections and suppress weak/noisy ones
Topological Overlap Matrix (TOM) — measures shared network neighbors, making the network more robust than raw correlations
Module — a cluster of tightly co-expressed genes (identified via hierarchical clustering of the TOM)
Module Eigengene (ME) — the first principal component of a module's expression; a single summary "super-gene" representing the module
Hub gene — the most connected gene within a module; often the best candidate for functional follow-up

Scale-Free Assumption: WGCNA assumes the gene network follows a scale-free topology — most genes have few connections, but a few "hub" genes have many. This mirrors real biological networks. The soft-threshold power selection step ensures your network approximates this property.

0 Prerequisites & Installation

R

install.packages(c("WGCNA", "tidyverse", "magrittr"))

# For normalization
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("DESeq2", "genefilter"))

library(WGCNA)
library(tidyverse)
library(DESeq2)

Sample Requirements: WGCNA needs at minimum 15 samples to produce reliable results. With fewer than 15, correlations are noisy and modules become unreliable. For best results, aim for 20+ samples spanning multiple conditions or time points.

1 Prepare & Normalize Input Data

WGCNA expects a samples × genes matrix of normalized expression values (not raw counts). We use DESeq2's variance-stabilizing transformation (VST) to normalize, then subset to the most variable genes for computational efficiency.

R

# Load your raw count data (genes x samples)
raw_counts <- read.delim("counts.txt", row.names = 1)

# Build a minimal DESeq2 object for normalization
sample_meta <- data.frame(
  row.names = colnames(raw_counts),
  group = factor(gsub("[_.-].*", "", colnames(raw_counts)))
)

dds <- DESeqDataSetFromMatrix(
  countData = round(as.matrix(raw_counts)),
  colData   = sample_meta,
  design    = ~ group
)
dds <- DESeq(dds)

# Variance-Stabilizing Transformation
vsd <- getVarianceStabilizedData(dds)

# Subset to top variable genes
rv <- matrixStats::rowVars(vsd)
q_threshold <- quantile(rv, 0.75)  # Keep top 25% most variable genes
expr_mat <- vsd[rv > q_threshold, ]
cat("Genes for WGCNA:", nrow(expr_mat), "\n")

# CRITICAL: Transpose so rows = samples, columns = genes
input_mat <- t(expr_mat)
dim(input_mat)

How Many Genes to Include?

Top 25% by variance — good default; balances computation with gene coverage (~5,000–10,000 genes typically)
Top 5% by variance — faster; focuses only on the most dynamic genes; may miss subtle modules
All expressed genes — thorough but slow; only feasible with maxBlockSize tuning and sufficient RAM
Only DE genes — controversial; may break the scale-free assumption since you're removing the "many weakly-connected" genes; use with caution

Check for Outlier Samples

R

# Hierarchical clustering to detect outlier samples
sample_tree <- hclust(dist(input_mat), method = "average")
plot(sample_tree, main = "Sample Clustering to Detect Outliers",
     sub = "", xlab = "", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

# Draw a cut line — samples above this may be outliers
abline(h = 150, col = "red")  # Adjust height based on your data

Outlier Removal: If a sample is clearly separated in the dendrogram (long branch to reach the rest), remove it. Even one outlier can distort module detection because WGCNA uses correlations, which are highly sensitive to outliers. Use goodSamplesGenes() as an automated check.

2 Choose the Soft-Thresholding Power

This is the most critical decision in WGCNA. The soft power β determines how aggressively weak correlations are suppressed. A higher β creates a sparser, more modular network.

R

allowWGCNAThreads()

powers <- c(1:20)
sft <- pickSoftThreshold(input_mat, powerVector = powers, verbose = 5)

# Two-panel diagnostic plot
par(mfrow = c(1, 2))

# Panel 1: Scale-free topology fit index vs power
plot(sft$fitIndices[, 1],
     -sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
     xlab = "Soft Threshold (power)",
     ylab = "Scale Free Topology Model Fit (signed R²)",
     main = "Scale Independence",
     type = "n")
text(sft$fitIndices[, 1],
     -sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
     labels = powers, col = "red")
abline(h = 0.85, col = "blue", lty = 2)

# Panel 2: Mean connectivity vs power
plot(sft$fitIndices[, 1], sft$fitIndices[, 5],
     xlab = "Soft Threshold (power)",
     ylab = "Mean Connectivity",
     main = "Mean Connectivity",
     type = "n")
text(sft$fitIndices[, 1], sft$fitIndices[, 5],
     labels = powers, col = "red")

How to Pick the Right Power

Rules

Pick the lowest power where the scale-free fit R² first exceeds 0.85 (the blue dashed line)
The signed R² should be negative slope — if it's positive, the network is not scale-free at that power
Check that mean connectivity is not too low (below ~50) at your chosen power — some connectivity is needed for meaningful modules

Typical Values

Signed network: power 6–12 (commonly 8–10)
Unsigned network: power 4–8 (commonly 6)
If R² never reaches 0.85 even at power 20, your data may have strong batch effects or too few samples

Signed vs. Unsigned Networks:

Signed (recommended): genes that are positively correlated cluster together; negatively correlated genes go to different modules. Better biological interpretability.
Unsigned: treats positive and negative correlations equally. Use only if you specifically want anti-correlated genes in the same module.

3 Build the Network & Detect Modules

R

picked_power <- 9  # Replace with your chosen value

# Fix namespace conflict between stats::cor and WGCNA::cor
temp_cor <- cor
cor <- WGCNA::cor

netwk <- blockwiseModules(
  input_mat,
  power                = picked_power,
  networkType          = "signed",

  # --- Tree & clustering ---
  deepSplit            = 2,          # 0-4; higher = more smaller modules
  minModuleSize        = 30,         # minimum genes per module
  maxBlockSize         = 20000,      # genes per computational block

  # --- Module merging ---
  mergeCutHeight       = 0.25,       # modules with correlation > 0.75 are merged

  # --- Reassignment ---
  pamRespectsDendro    = FALSE,
  reassignThreshold    = 0,

  # --- Save TOM for later reuse ---
  saveTOMs             = TRUE,
  saveTOMFileBase      = "TOM",

  numericLabels        = TRUE,
  verbose              = 3
)

cor <- temp_cor  # Restore default cor function

table(netwk$colors)  # Number of genes per module

This will take a while! For ~5,000 genes, expect 5–15 minutes. For 15,000+ genes, it can take an hour or more. The TOM calculation is the bottleneck. If maxBlockSize is smaller than your gene count, WGCNA splits genes into blocks and merges modules afterwards — this is faster but slightly less accurate.

Parameter	What It Controls	How to Tune
`power`	Soft-thresholding exponent	Use value from Step 2 diagnostic plot
`networkType`	Signed, unsigned, or signed hybrid	`"signed"` for most biological applications
`deepSplit`	Sensitivity of module splitting (0–4)	2 = default. Increase to 3 or 4 for finer-grained modules. Decrease to 1 for fewer, larger modules.
`minModuleSize`	Smallest allowed module	30 is standard. Increase to 50–100 for cleaner modules; decrease to 15–20 if you expect small regulatory modules.
`mergeCutHeight`	Dissimilarity threshold for merging similar modules	0.25 = merge modules with eigengene correlation > 0.75. Lower value = more merging.
`maxBlockSize`	Genes per computation block	Set to total gene count if RAM allows (16GB+ for ~15k genes). Otherwise, use 5000–10000.

Module Dendrogram

R

module_colors <- labels2colors(netwk$colors)

plotDendroAndColors(
  netwk$dendrograms[[1]],
  module_colors[netwk$blockGenes[[1]]],
  "Module Colors",
  dendroLabels = FALSE,
  hang = 0.03,
  addGuide = TRUE,
  guideHang = 0.05,
  main = "Gene Dendrogram with Module Colors"
)

4 Explore Module Composition

R

# Create a gene → module mapping
module_df <- data.frame(
  gene_id = names(netwk$colors),
  module  = labels2colors(netwk$colors),
  stringsAsFactors = FALSE
)

# Count genes per module
module_df %>% count(module, sort = TRUE)

# Save for later use
write.table(module_df, "gene_module_assignments.tsv",
            sep = "\t", row.names = FALSE, quote = FALSE)

5 Module–Trait Relationships

The key biological question: which modules associate with which experimental conditions or phenotypes? WGCNA computes a module eigengene for each module and correlates it against your trait data.

R

# Calculate module eigengenes
MEs0 <- moduleEigengenes(input_mat, module_colors)$eigengenes
MEs  <- orderMEs(MEs0)

# Create a numeric trait matrix (encode conditions as 0/1)
# Adapt this to your actual experimental groups
traits <- sample_meta
traits$group <- as.numeric(factor(traits$group)) - 1

# Correlate eigengenes with traits
module_trait_cor  <- cor(MEs, traits, use = "p")
module_trait_pval <- corPvalueStudent(module_trait_cor, nrow(input_mat))

# Heatmap of module-trait relationships
textMatrix <- paste0(signif(module_trait_cor, 2),
                      "\n(", signif(module_trait_pval, 1), ")")
dim(textMatrix) <- dim(module_trait_cor)

labeledHeatmap(
  Matrix       = module_trait_cor,
  xLabels      = colnames(traits),
  yLabels      = names(MEs),
  ySymbols     = names(MEs),
  colorLabels  = FALSE,
  colors       = blueWhiteRed(50),
  textMatrix   = textMatrix,
  setStdMargins = FALSE,
  cex.text     = 0.6,
  zlim         = c(-1, 1),
  main         = "Module-Trait Relationships"
)

Reading the Heatmap: Each cell shows the correlation and its p-value. Strong red = module expression is high when trait value is high. Strong blue = inverse relationship. Focus on modules with |correlation| > 0.5 and p-value < 0.05 for follow-up.

Expression Profiles of Key Modules

R

modules_of_interest <- c("turquoise", "blue", "green")

sub_genes <- module_df %>% filter(module %in% modules_of_interest)
sub_expr  <- expr_mat[sub_genes$gene_id, ]

# Tidy for ggplot
plot_df <- as.data.frame(sub_expr) %>%
  mutate(gene_id = rownames(.)) %>%
  pivot_longer(-gene_id, names_to = "sample", values_to = "expr") %>%
  left_join(module_df, by = "gene_id")

ggplot(plot_df, aes(x = sample, y = expr, group = gene_id)) +
  geom_line(aes(color = module), alpha = 0.15, linewidth = 0.3) +
  facet_grid(rows = vars(module)) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7)) +
  labs(title = "Expression Profiles by Module",
       x = "Sample", y = "Normalized Expression") +
  scale_color_identity()

6 Hub Gene Identification

Hub genes are the most highly connected genes within a module. They often play central regulatory roles and are prime candidates for experimental validation.

R

# Module membership (kME) = correlation of each gene with its module eigengene
kME <- signedKME(input_mat, MEs)

# For a specific module (e.g., turquoise):
target_module <- "turquoise"
target_genes  <- module_df %>% filter(module == target_module) %>% pull(gene_id)
kme_col       <- paste0("kME", target_module)

hub_df <- data.frame(
  gene_id = target_genes,
  kME     = kME[target_genes, kme_col]
) %>%
  arrange(desc(kME))

# Top 10 hub genes
head(hub_df, 10)

What Makes a Good Hub Gene?

High module membership (kME > 0.8) — strongly correlated with the module eigengene
High gene significance — also correlated with your trait of interest
Known biological function — bonus if the gene has a plausible role in the phenotype
The intersection of high kME and high gene significance gives you the strongest candidates

7 Export Network for Cytoscape

For interactive network visualization, export the module network to Cytoscape or igraph.

R

# Recalculate TOM for modules of interest
target_genes <- module_df %>%
  filter(module %in% modules_of_interest) %>%
  pull(gene_id)

target_expr <- expr_mat[target_genes, ]

TOM <- TOMsimilarityFromExpr(t(target_expr), power = picked_power)
rownames(TOM) <- rownames(target_expr)
colnames(TOM) <- rownames(target_expr)

# Build edge list
edge_list <- as.data.frame(TOM) %>%
  mutate(gene1 = rownames(.)) %>%
  pivot_longer(-gene1, names_to = "gene2", values_to = "weight") %>%
  filter(gene1 != gene2, weight > 0.02) %>%  # filter weak edges
  left_join(module_df, by = c("gene1" = "gene_id")) %>%
  rename(module1 = module) %>%
  left_join(module_df, by = c("gene2" = "gene_id")) %>%
  rename(module2 = module)

# Save for Cytoscape import
write.table(edge_list, "network_edge_list.tsv",
            sep = "\t", row.names = FALSE, quote = FALSE)
cat("Edges exported:", nrow(edge_list), "\n")

Cytoscape Import: In Cytoscape, use File → Import → Network from File. Set gene1 as Source, gene2 as Target, and weight as Edge Attribute. Then apply a force-directed layout. Color nodes by module and size by kME for a publication-ready figure.

Parameter Deep-Dive: Tuning Your Network

WGCNA has many parameters that affect module detection. Here's a systematic approach to tuning:

If You Want...	Adjust This	Direction
More modules (finer resolution)	`deepSplit`	Increase to 3 or 4
Fewer modules (broader grouping)	`mergeCutHeight`	Increase (e.g., 0.4 merges more aggressively)
Remove tiny modules	`minModuleSize`	Increase to 50 or 100
Tighter, more connected modules	`power`	Increase slightly (more stringent adjacency)
Faster computation	`maxBlockSize`	Decrease (genes split into smaller blocks, but slight accuracy trade-off)
Better biological interpretability	`networkType`	Use `"signed"`

Summary

What You Learned:

Prepare and normalize expression data for WGCNA input
Choose the soft-thresholding power using the scale-free fit diagnostic
Build the co-expression network and detect modules with blockwiseModules
Relate modules to experimental traits via eigengene correlation
Identify hub genes using module membership (kME)
Export networks for Cytoscape visualization
Tune every major parameter for your specific dataset

Prev: DESeq2 Next: Data Visualization

On this page