Bioinformatics Data Visualization Masterclass

From boxplots to interactive volcano plots — build every figure a bioinformatics paper needs, with the grammar of graphics and modern R packages.

~40 min Beginner R / ggplot2
Byte says Let's Learn

? The Grammar of Graphics

ggplot2 is built on Leland Wilkinson's "Grammar of Graphics" — the idea that every statistical graphic is composed of layers. Understanding these layers gives you total control over your figures:

  1. Data — the tidy data frame you're plotting
  2. Aesthetics (aes) — mappings from data columns to visual properties (x, y, color, size, shape)
  3. Geometries (geom_*) — the visual marks (points, lines, bars, boxplots)
  4. Facets — split one plot into small multiples by a categorical variable
  5. Scales — control how data values map to aesthetic values (axis limits, color palettes)
  6. Themes — non-data visual elements (fonts, grid lines, backgrounds)
Byte key point
The Golden Rule of ggplot2: Your data must be in tidy (long) format — one observation per row, one variable per column. Use tidyr::pivot_longer() to reshape wide data before plotting. If your data isn't tidy, your ggplot code will fight you every step of the way.

0 Setup & Packages

R
install.packages(c("tidyverse", "pheatmap", "RColorBrewer",
                    "plotly", "ggrepel", "patchwork"))

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("EnhancedVolcano", "ComplexHeatmap"))

library(tidyverse)      # includes ggplot2, dplyr, tidyr, readr
library(pheatmap)
library(RColorBrewer)
library(plotly)
library(ggrepel)
library(patchwork)      # combine multiple ggplots
library(EnhancedVolcano)

1 Boxplots & Violin Plots

Boxplots summarize distribution with median, quartiles, and outliers. Violin plots add a kernel density estimate, showing the full distribution shape. Combining both gives the most informative view.

Basic Boxplot
R
# Simulate some gene expression data across organisms and gene families
set.seed(42)
expr_data <- expand.grid(
  organism = c("Human", "Mouse", "Zebrafish"),
  family   = paste0("GeneFamily_", LETTERS[1:4]),
  rep      = 1:20
) %>%
  mutate(
    expression = rnorm(n(), mean = 500, sd = 150) +
      as.numeric(factor(organism)) * 50 +
      as.numeric(factor(family)) * 30
  )

# Simple boxplot
ggplot(expr_data, aes(x = organism, y = expression)) +
  geom_boxplot() +
  theme_minimal(base_size = 14) +
  labs(title = "Gene Expression by Organism",
       x = "Organism", y = "Expression (counts)")
Colored Boxplot with Facets
R
ggplot(expr_data, aes(x = organism, y = expression, fill = organism)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 2, alpha = 0.8) +
  facet_wrap(~ family, scales = "free_y") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  labs(title = "Expression Distribution by Gene Family",
       x = NULL, y = "Expression")
Violin + Boxplot + Jitter Combo
R
# The most informative single-plot approach
ggplot(expr_data, aes(x = organism, y = expression, fill = organism)) +
  geom_violin(alpha = 0.4, width = 0.9) +          # distribution shape
  geom_boxplot(width = 0.15, outlier.shape = NA) +  # summary stats
  geom_jitter(width = 0.05, alpha = 0.3, size = 1) + # individual points
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none") +
  labs(title = "Violin + Boxplot + Jitter",
       subtitle = "Shows distribution, summary, and individual observations",
       x = NULL, y = "Expression")
Byte tip
When to Use Which:
  • Boxplot alone — when you have many groups and need a compact display
  • Violin — when distribution shape matters (bimodal? skewed?)
  • Jitter overlay — when n < 50 per group, so individual points are meaningful
  • Beeswarm (ggbeeswarm) — like jitter but non-overlapping; elegant for small n

2 PCA Plots

PCA (Principal Component Analysis) reduces high-dimensional expression data to 2D for visualization. It reveals global patterns: do samples cluster by condition? Is there a batch effect?

R
# Assuming you have a DESeq2 vsd object from the RNA-Seq tutorial
# vsd <- vst(dds, blind = FALSE)

# Method 1: DESeq2 built-in (quick)
plotPCA(vsd, intgroup = "condition") +
  theme_minimal(base_size = 14)

# Method 2: Custom PCA with more control
pca_data <- prcomp(t(assay(vsd)), center = TRUE, scale. = TRUE)
pct_var   <- round(100 * summary(pca_data)$importance[2, 1:2], 1)

pca_df <- data.frame(
  PC1       = pca_data$x[, 1],
  PC2       = pca_data$x[, 2],
  condition = colData(vsd)$condition,
  sample    = colnames(vsd)
)

ggplot(pca_df, aes(x = PC1, y = PC2, color = condition)) +
  geom_point(size = 4, alpha = 0.9) +
  geom_text_repel(aes(label = sample), size = 3, max.overlaps = 10) +
  stat_ellipse(type = "norm", linetype = 2, level = 0.95) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 14) +
  labs(
    title = "PCA of RNA-Seq Samples",
    x = paste0("PC1 (", pct_var[1], "% variance)"),
    y = paste0("PC2 (", pct_var[2], "% variance)")
  )
Byte key point
Interpreting PCA:
  • Samples in the same condition should cluster together — if they don't, check for batch effects or outliers
  • PC1 captures the most variance — if it separates by batch instead of condition, that's a red flag
  • Always report the % variance explained by each PC on the axis labels
  • Confidence ellipses (95%) help visually assess separation, but are not a formal statistical test

3 Volcano Plots

The volcano plot is the signature visualization for differential expression results. X-axis = effect size (log2FC), Y-axis = significance (-log10 padj).

Custom ggplot2 Volcano Plot
R
# Assuming res_df from DESeq2 tutorial (gene, log2FoldChange, padj columns)
res_df <- res_df %>%
  mutate(
    significance = case_when(
      padj < 0.05 & log2FoldChange >  1 ~ "Up",
      padj < 0.05 & log2FoldChange < -1 ~ "Down",
      TRUE                               ~ "NS"
    )
  )

# Count genes in each category for the subtitle
up_n   <- sum(res_df$significance == "Up", na.rm = TRUE)
down_n <- sum(res_df$significance == "Down", na.rm = TRUE)

ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = significance)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_manual(values = c("Up" = "#e74c3c", "Down" = "#3498db", "NS" = "grey70")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "grey40") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey40") +
  geom_text_repel(
    data = res_df %>% filter(padj < 1e-10, abs(log2FoldChange) > 2) %>% head(15),
    aes(label = gene), size = 3, max.overlaps = 15, color = "black"
  ) +
  theme_minimal(base_size = 14) +
  labs(
    title    = "Volcano Plot: Treated vs Control",
    subtitle = paste0(up_n, " upregulated, ", down_n, " downregulated (padj<0.05, |LFC|>1)"),
    x = expression(log[2]~"Fold Change"),
    y = expression(-log[10]~"adjusted p-value"),
    color = "Status"
  )
EnhancedVolcano (One-Liner)
R
library(EnhancedVolcano)

EnhancedVolcano(res_df,
  lab       = res_df$gene,
  x         = 'log2FoldChange',
  y         = 'padj',
  pCutoff   = 0.05,
  FCcutoff  = 1.0,
  title     = 'Treated vs Control',
  subtitle  = 'Shrunken log2FC, BH-adjusted p-values',
  pointSize = 2.5,
  labSize   = 3.5,
  drawConnectors = TRUE)

4 MA Plots

MA plots show log2FC (M) vs. mean expression (A). They reveal whether fold changes are biased by expression level.

R
ggplot(res_df, aes(x = log10(baseMean + 1), y = log2FoldChange, color = significance)) +
  geom_point(alpha = 0.4, size = 1) +
  scale_color_manual(values = c("Up" = "#e74c3c", "Down" = "#3498db", "NS" = "grey75")) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  theme_minimal(base_size = 14) +
  labs(
    title = "MA Plot",
    x = expression(log[10]~"Mean Expression"),
    y = expression(log[2]~"Fold Change"),
    color = NULL
  )
Byte explains
Reading an MA Plot: A healthy MA plot should show a symmetric cloud of grey points centered on y=0, with colored significant genes spread across all expression levels. If the cloud tilts up or down, there may be a normalization issue. If significant genes cluster only at low expression (left side), increase your pre-filtering cutoff.

5 Heatmaps

Heatmaps are essential for showing patterns across many genes and samples simultaneously. We cover three tools: base R heatmap(), pheatmap, and ComplexHeatmap.

pheatmap (Best All-Rounder)
R
library(pheatmap)

# Top 50 DE genes, variance-stabilized expression, row-centered
top_genes <- res_df %>% filter(padj < 0.05) %>% arrange(padj) %>% head(50) %>% pull(gene)
mat <- assay(vsd)[top_genes, ]
mat <- mat - rowMeans(mat)  # Center each gene to mean 0

# Annotation bar for columns
anno_col <- data.frame(
  row.names = colnames(mat),
  Condition = colData(vsd)$condition
)

pheatmap(
  mat,
  scale              = "none",        # already centered
  cluster_rows       = TRUE,
  cluster_cols       = TRUE,
  clustering_method  = "ward.D2",     # compact clusters
  clustering_distance_rows = "correlation",
  show_rownames      = TRUE,
  show_colnames      = TRUE,
  annotation_col     = anno_col,
  color              = colorRampPalette(c("#2166AC", "white", "#B2182B"))(100),
  border_color       = NA,
  fontsize_row       = 7,
  main               = "Top 50 DE Genes (Row-Centered)"
)
ParameterOptions & When to Use
scale"none" if pre-centered. "row" to z-score normalize per gene. "column" rarely used.
clustering_method"ward.D2" (tight clusters), "complete" (default), "average" (UPGMA — used in phylogenetics)
clustering_distance_rows"correlation" for expression (groups co-expressed genes). "euclidean" for magnitude-based clustering.
colorDiverging palette (blue-white-red) for centered data. Sequential palette for raw values. Never use rainbow.
cutree_rowsInteger — cuts the row dendrogram into N groups and adds gap lines. Great for showing module structure.
ComplexHeatmap (Maximum Control)
R
library(ComplexHeatmap)
library(circlize)

col_fun <- colorRamp2(c(-2, 0, 2), c("#2166AC", "white", "#B2182B"))

ha <- HeatmapAnnotation(
  Condition = colData(vsd)$condition,
  col = list(Condition = c("Control" = "#4DAF4A", "Treated" = "#E41A1C"))
)

Heatmap(
  mat,
  name                    = "Z-score",
  col                     = col_fun,
  top_annotation          = ha,
  show_row_names          = TRUE,
  show_column_names       = TRUE,
  row_names_gp            = gpar(fontsize = 6),
  clustering_method_rows  = "ward.D2",
  clustering_distance_rows = "pearson",
  column_title            = "Top 50 Differentially Expressed Genes"
)
Byte tip
ComplexHeatmap vs. pheatmap: Use pheatmap for quick, clean heatmaps with minimal code. Switch to ComplexHeatmap when you need multiple annotation tracks, split heatmaps, or integrated density/bar plots on the margins. ComplexHeatmap is more verbose but infinitely more powerful.
Sample Distance Heatmap
R
# Euclidean distance between samples
dist_mat <- as.matrix(dist(t(assay(vsd))))

pheatmap(
  dist_mat,
  clustering_distance_rows = as.dist(dist_mat),
  clustering_distance_cols = as.dist(dist_mat),
  color = colorRampPalette(c("navy", "white", "firebrick3"))(100),
  annotation_col = anno_col,
  main = "Sample-to-Sample Distance"
)

6 Interactive Charts with Plotly

Static figures are great for papers, but interactive charts let you hover over data points — perfect for exploring results, presentations, and supplementary material.

R
library(plotly)

# Turn any ggplot into an interactive plot with one line
p <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj),
                         color = significance, text = gene)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_manual(values = c("Up" = "#e74c3c", "Down" = "#3498db", "NS" = "grey70")) +
  theme_minimal() +
  labs(title = "Interactive Volcano Plot")

ggplotly(p, tooltip = c("text", "x", "y"))

# Save as self-contained HTML
htmlwidgets::saveWidget(ggplotly(p), "volcano_interactive.html")
Byte with networks
Interactive PCA: The same ggplotly() trick works for PCA plots — hover to see sample names, zoom into clusters, and share the HTML file with collaborators who don't use R.

7 Themes, Color Palettes & Export

Publication-Ready Themes
R
# Recommended theme stack for publications
theme_pub <- theme_minimal(base_size = 14) +
  theme(
    plot.title      = element_text(face = "bold", size = 16),
    axis.title      = element_text(face = "bold"),
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    strip.text      = element_text(face = "bold")
  )

# Apply to any ggplot
my_plot + theme_pub
Color Palette Recommendations
Use CasePaletteCode
Categorical (up to 8 groups)Brewer "Set2" or "Dark2"scale_color_brewer(palette = "Set2")
Diverging (heatmaps, fold change)Blue-White-Redscale_fill_gradient2(low="blue", mid="white", high="red")
Sequential (expression level)Viridisscale_fill_viridis_c()
Colorblind-safeOkabe-Ito or viridisscale_color_manual(values = c(...))
Saving Figures
R
# Vector formats for publication (scalable, small file size)
ggsave("figure1.pdf", plot = my_plot, width = 8, height = 6, dpi = 300)
ggsave("figure1.svg", plot = my_plot, width = 8, height = 6)

# Raster format for supplementary / web
ggsave("figure1.png", plot = my_plot, width = 8, height = 6, dpi = 300)

# Combine multiple plots with patchwork
library(patchwork)
combined <- (pca_plot | volcano_plot) / heatmap_plot +
  plot_annotation(tag_levels = "A")   # adds A, B, C labels
ggsave("figure_combined.pdf", combined, width = 14, height = 12)
Byte warns
Journal Requirements: Most journals require 300 DPI minimum for raster images and prefer vector formats (PDF, SVG, EPS). Always check the target journal's figure guidelines before exporting. Use cairo_pdf() if you need embedded fonts.

Summary

Byte
What You Learned:
  1. The grammar of graphics and how ggplot2 layers work
  2. Boxplots, violin plots, and when to combine them with jitter
  3. Custom PCA plots with variance labels and confidence ellipses
  4. Publication-quality volcano and MA plots
  5. Heatmaps with pheatmap and ComplexHeatmap, including clustering options
  6. Interactive exploration with plotly
  7. Color palettes, themes, and export for publication