Gene Co-Expression Network Analysis with WGCNA

Discover modules of co-expressed genes, relate them to traits, identify hub genes, and export publication-ready networks — with every decision point explained.

~60 min Advanced R / RStudio
Byte with gene networks

? What Is WGCNA?

Weighted Gene Correlation Network Analysis clusters genes into modules based on their co-expression patterns across samples. Instead of asking "which genes change between two groups?" (that's DE analysis), WGCNA asks: "which groups of genes behave similarly across all conditions, and do those groups relate to specific traits?"

Core Concepts
  • Adjacency matrix — pairwise correlations raised to a "soft power" to emphasize strong connections and suppress weak/noisy ones
  • Topological Overlap Matrix (TOM) — measures shared network neighbors, making the network more robust than raw correlations
  • Module — a cluster of tightly co-expressed genes (identified via hierarchical clustering of the TOM)
  • Module Eigengene (ME) — the first principal component of a module's expression; a single summary "super-gene" representing the module
  • Hub gene — the most connected gene within a module; often the best candidate for functional follow-up
Byte key point
Scale-Free Assumption: WGCNA assumes the gene network follows a scale-free topology — most genes have few connections, but a few "hub" genes have many. This mirrors real biological networks. The soft-threshold power selection step ensures your network approximates this property.

0 Prerequisites & Installation

R
install.packages(c("WGCNA", "tidyverse", "magrittr"))

# For normalization
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("DESeq2", "genefilter"))

library(WGCNA)
library(tidyverse)
library(DESeq2)
Byte tip
Sample Requirements: WGCNA needs at minimum 15 samples to produce reliable results. With fewer than 15, correlations are noisy and modules become unreliable. For best results, aim for 20+ samples spanning multiple conditions or time points.

1 Prepare & Normalize Input Data

WGCNA expects a samples × genes matrix of normalized expression values (not raw counts). We use DESeq2's variance-stabilizing transformation (VST) to normalize, then subset to the most variable genes for computational efficiency.

R
# Load your raw count data (genes x samples)
raw_counts <- read.delim("counts.txt", row.names = 1)

# Build a minimal DESeq2 object for normalization
sample_meta <- data.frame(
  row.names = colnames(raw_counts),
  group = factor(gsub("[_.-].*", "", colnames(raw_counts)))
)

dds <- DESeqDataSetFromMatrix(
  countData = round(as.matrix(raw_counts)),
  colData   = sample_meta,
  design    = ~ group
)
dds <- DESeq(dds)

# Variance-Stabilizing Transformation
vsd <- getVarianceStabilizedData(dds)

# Subset to top variable genes
rv <- matrixStats::rowVars(vsd)
q_threshold <- quantile(rv, 0.75)  # Keep top 25% most variable genes
expr_mat <- vsd[rv > q_threshold, ]
cat("Genes for WGCNA:", nrow(expr_mat), "\n")

# CRITICAL: Transpose so rows = samples, columns = genes
input_mat <- t(expr_mat)
dim(input_mat)
How Many Genes to Include?
  • Top 25% by variance — good default; balances computation with gene coverage (~5,000–10,000 genes typically)
  • Top 5% by variance — faster; focuses only on the most dynamic genes; may miss subtle modules
  • All expressed genes — thorough but slow; only feasible with maxBlockSize tuning and sufficient RAM
  • Only DE genes — controversial; may break the scale-free assumption since you're removing the "many weakly-connected" genes; use with caution
Check for Outlier Samples
R
# Hierarchical clustering to detect outlier samples
sample_tree <- hclust(dist(input_mat), method = "average")
plot(sample_tree, main = "Sample Clustering to Detect Outliers",
     sub = "", xlab = "", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

# Draw a cut line — samples above this may be outliers
abline(h = 150, col = "red")  # Adjust height based on your data
Byte warning
Outlier Removal: If a sample is clearly separated in the dendrogram (long branch to reach the rest), remove it. Even one outlier can distort module detection because WGCNA uses correlations, which are highly sensitive to outliers. Use goodSamplesGenes() as an automated check.

2 Choose the Soft-Thresholding Power

This is the most critical decision in WGCNA. The soft power β determines how aggressively weak correlations are suppressed. A higher β creates a sparser, more modular network.

R
allowWGCNAThreads()

powers <- c(1:20)
sft <- pickSoftThreshold(input_mat, powerVector = powers, verbose = 5)

# Two-panel diagnostic plot
par(mfrow = c(1, 2))

# Panel 1: Scale-free topology fit index vs power
plot(sft$fitIndices[, 1],
     -sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
     xlab = "Soft Threshold (power)",
     ylab = "Scale Free Topology Model Fit (signed R²)",
     main = "Scale Independence",
     type = "n")
text(sft$fitIndices[, 1],
     -sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
     labels = powers, col = "red")
abline(h = 0.85, col = "blue", lty = 2)

# Panel 2: Mean connectivity vs power
plot(sft$fitIndices[, 1], sft$fitIndices[, 5],
     xlab = "Soft Threshold (power)",
     ylab = "Mean Connectivity",
     main = "Mean Connectivity",
     type = "n")
text(sft$fitIndices[, 1], sft$fitIndices[, 5],
     labels = powers, col = "red")
How to Pick the Right Power
Rules
  1. Pick the lowest power where the scale-free fit R² first exceeds 0.85 (the blue dashed line)
  2. The signed R² should be negative slope — if it's positive, the network is not scale-free at that power
  3. Check that mean connectivity is not too low (below ~50) at your chosen power — some connectivity is needed for meaningful modules
Typical Values
  • Signed network: power 6–12 (commonly 8–10)
  • Unsigned network: power 4–8 (commonly 6)
  • If R² never reaches 0.85 even at power 20, your data may have strong batch effects or too few samples
Byte key point
Signed vs. Unsigned Networks:
  • Signed (recommended): genes that are positively correlated cluster together; negatively correlated genes go to different modules. Better biological interpretability.
  • Unsigned: treats positive and negative correlations equally. Use only if you specifically want anti-correlated genes in the same module.

3 Build the Network & Detect Modules

R
picked_power <- 9  # Replace with your chosen value

# Fix namespace conflict between stats::cor and WGCNA::cor
temp_cor <- cor
cor <- WGCNA::cor

netwk <- blockwiseModules(
  input_mat,
  power                = picked_power,
  networkType          = "signed",

  # --- Tree & clustering ---
  deepSplit            = 2,          # 0-4; higher = more smaller modules
  minModuleSize        = 30,         # minimum genes per module
  maxBlockSize         = 20000,      # genes per computational block

  # --- Module merging ---
  mergeCutHeight       = 0.25,       # modules with correlation > 0.75 are merged

  # --- Reassignment ---
  pamRespectsDendro    = FALSE,
  reassignThreshold    = 0,

  # --- Save TOM for later reuse ---
  saveTOMs             = TRUE,
  saveTOMFileBase      = "TOM",

  numericLabels        = TRUE,
  verbose              = 3
)

cor <- temp_cor  # Restore default cor function

table(netwk$colors)  # Number of genes per module
Byte waiting
This will take a while! For ~5,000 genes, expect 5–15 minutes. For 15,000+ genes, it can take an hour or more. The TOM calculation is the bottleneck. If maxBlockSize is smaller than your gene count, WGCNA splits genes into blocks and merges modules afterwards — this is faster but slightly less accurate.
ParameterWhat It ControlsHow to Tune
powerSoft-thresholding exponentUse value from Step 2 diagnostic plot
networkTypeSigned, unsigned, or signed hybrid"signed" for most biological applications
deepSplitSensitivity of module splitting (0–4)2 = default. Increase to 3 or 4 for finer-grained modules. Decrease to 1 for fewer, larger modules.
minModuleSizeSmallest allowed module30 is standard. Increase to 50–100 for cleaner modules; decrease to 15–20 if you expect small regulatory modules.
mergeCutHeightDissimilarity threshold for merging similar modules0.25 = merge modules with eigengene correlation > 0.75. Lower value = more merging.
maxBlockSizeGenes per computation blockSet to total gene count if RAM allows (16GB+ for ~15k genes). Otherwise, use 5000–10000.
Module Dendrogram
R
module_colors <- labels2colors(netwk$colors)

plotDendroAndColors(
  netwk$dendrograms[[1]],
  module_colors[netwk$blockGenes[[1]]],
  "Module Colors",
  dendroLabels = FALSE,
  hang = 0.03,
  addGuide = TRUE,
  guideHang = 0.05,
  main = "Gene Dendrogram with Module Colors"
)

4 Explore Module Composition

R
# Create a gene → module mapping
module_df <- data.frame(
  gene_id = names(netwk$colors),
  module  = labels2colors(netwk$colors),
  stringsAsFactors = FALSE
)

# Count genes per module
module_df %>% count(module, sort = TRUE)

# Save for later use
write.table(module_df, "gene_module_assignments.tsv",
            sep = "\t", row.names = FALSE, quote = FALSE)

5 Module–Trait Relationships

The key biological question: which modules associate with which experimental conditions or phenotypes? WGCNA computes a module eigengene for each module and correlates it against your trait data.

R
# Calculate module eigengenes
MEs0 <- moduleEigengenes(input_mat, module_colors)$eigengenes
MEs  <- orderMEs(MEs0)

# Create a numeric trait matrix (encode conditions as 0/1)
# Adapt this to your actual experimental groups
traits <- sample_meta
traits$group <- as.numeric(factor(traits$group)) - 1

# Correlate eigengenes with traits
module_trait_cor  <- cor(MEs, traits, use = "p")
module_trait_pval <- corPvalueStudent(module_trait_cor, nrow(input_mat))

# Heatmap of module-trait relationships
textMatrix <- paste0(signif(module_trait_cor, 2),
                      "\n(", signif(module_trait_pval, 1), ")")
dim(textMatrix) <- dim(module_trait_cor)

labeledHeatmap(
  Matrix       = module_trait_cor,
  xLabels      = colnames(traits),
  yLabels      = names(MEs),
  ySymbols     = names(MEs),
  colorLabels  = FALSE,
  colors       = blueWhiteRed(50),
  textMatrix   = textMatrix,
  setStdMargins = FALSE,
  cex.text     = 0.6,
  zlim         = c(-1, 1),
  main         = "Module-Trait Relationships"
)
Byte explains
Reading the Heatmap: Each cell shows the correlation and its p-value. Strong red = module expression is high when trait value is high. Strong blue = inverse relationship. Focus on modules with |correlation| > 0.5 and p-value < 0.05 for follow-up.
Expression Profiles of Key Modules
R
modules_of_interest <- c("turquoise", "blue", "green")

sub_genes <- module_df %>% filter(module %in% modules_of_interest)
sub_expr  <- expr_mat[sub_genes$gene_id, ]

# Tidy for ggplot
plot_df <- as.data.frame(sub_expr) %>%
  mutate(gene_id = rownames(.)) %>%
  pivot_longer(-gene_id, names_to = "sample", values_to = "expr") %>%
  left_join(module_df, by = "gene_id")

ggplot(plot_df, aes(x = sample, y = expr, group = gene_id)) +
  geom_line(aes(color = module), alpha = 0.15, linewidth = 0.3) +
  facet_grid(rows = vars(module)) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7)) +
  labs(title = "Expression Profiles by Module",
       x = "Sample", y = "Normalized Expression") +
  scale_color_identity()

6 Hub Gene Identification

Hub genes are the most highly connected genes within a module. They often play central regulatory roles and are prime candidates for experimental validation.

R
# Module membership (kME) = correlation of each gene with its module eigengene
kME <- signedKME(input_mat, MEs)

# For a specific module (e.g., turquoise):
target_module <- "turquoise"
target_genes  <- module_df %>% filter(module == target_module) %>% pull(gene_id)
kme_col       <- paste0("kME", target_module)

hub_df <- data.frame(
  gene_id = target_genes,
  kME     = kME[target_genes, kme_col]
) %>%
  arrange(desc(kME))

# Top 10 hub genes
head(hub_df, 10)
What Makes a Good Hub Gene?
  • High module membership (kME > 0.8) — strongly correlated with the module eigengene
  • High gene significance — also correlated with your trait of interest
  • Known biological function — bonus if the gene has a plausible role in the phenotype
  • The intersection of high kME and high gene significance gives you the strongest candidates

7 Export Network for Cytoscape

For interactive network visualization, export the module network to Cytoscape or igraph.

R
# Recalculate TOM for modules of interest
target_genes <- module_df %>%
  filter(module %in% modules_of_interest) %>%
  pull(gene_id)

target_expr <- expr_mat[target_genes, ]

TOM <- TOMsimilarityFromExpr(t(target_expr), power = picked_power)
rownames(TOM) <- rownames(target_expr)
colnames(TOM) <- rownames(target_expr)

# Build edge list
edge_list <- as.data.frame(TOM) %>%
  mutate(gene1 = rownames(.)) %>%
  pivot_longer(-gene1, names_to = "gene2", values_to = "weight") %>%
  filter(gene1 != gene2, weight > 0.02) %>%  # filter weak edges
  left_join(module_df, by = c("gene1" = "gene_id")) %>%
  rename(module1 = module) %>%
  left_join(module_df, by = c("gene2" = "gene_id")) %>%
  rename(module2 = module)

# Save for Cytoscape import
write.table(edge_list, "network_edge_list.tsv",
            sep = "\t", row.names = FALSE, quote = FALSE)
cat("Edges exported:", nrow(edge_list), "\n")
Byte tip
Cytoscape Import: In Cytoscape, use File → Import → Network from File. Set gene1 as Source, gene2 as Target, and weight as Edge Attribute. Then apply a force-directed layout. Color nodes by module and size by kME for a publication-ready figure.

Parameter Deep-Dive: Tuning Your Network

WGCNA has many parameters that affect module detection. Here's a systematic approach to tuning:

If You Want...Adjust ThisDirection
More modules (finer resolution)deepSplitIncrease to 3 or 4
Fewer modules (broader grouping)mergeCutHeightIncrease (e.g., 0.4 merges more aggressively)
Remove tiny modulesminModuleSizeIncrease to 50 or 100
Tighter, more connected modulespowerIncrease slightly (more stringent adjacency)
Faster computationmaxBlockSizeDecrease (genes split into smaller blocks, but slight accuracy trade-off)
Better biological interpretabilitynetworkTypeUse "signed"

Summary

Byte
What You Learned:
  1. Prepare and normalize expression data for WGCNA input
  2. Choose the soft-thresholding power using the scale-free fit diagnostic
  3. Build the co-expression network and detect modules with blockwiseModules
  4. Relate modules to experimental traits via eigengene correlation
  5. Identify hub genes using module membership (kME)
  6. Export networks for Cytoscape visualization
  7. Tune every major parameter for your specific dataset