Module 6 — Comparative Genomics & Synteny

Whole Genome Alignment · Synteny & Collinearity · WGD Detection · Gene Family Evolution · Chromosomal Rearrangements · Regulatory Conservation

~160 min Advanced Bash / R / Python
Byte

6.1 — Whole Genome Alignment

ToolTypeBest ForSpeed
minimap2Pairwise, seed-chain-alignQuick pairwise, closely related genomesVery fast
MUMmer4 (nucmer)Pairwise, MUM-basedDetailed structural comparisons, dot plotsFast
CactusMultiple, reference-freeGold standard for multi-species alignmentSlow
Progressive MauveMultiple, LCB-basedBacterial genomes, rearrangementsModerate
SibeliaZMultiple, de Bruijn graphMultiple large eukaryotic genomesModerate
Bash — MUMmer4 & minimap2
# MUMmer4: pairwise whole-genome alignment
nucmer --maxmatch -l 100 -c 500 -t 16 \
  reference.fasta query.fasta -p alignment

# Filter alignments
delta-filter -i 90 -l 1000 alignment.delta > filtered.delta

# Generate dot plot
mummerplot --fat --filter --png -p dotplot filtered.delta
# Diagonal = collinear regions
# Off-diagonal = rearrangements (inversions, translocations)

# Generate 1-to-1 alignment coords
show-coords -r -c -l filtered.delta > alignment.coords

# minimap2: fast pairwise alignment
minimap2 -ax asm5 reference.fasta query.fasta \
  | samtools sort -o aligned.bam
samtools index aligned.bam

# Cactus: reference-free multiple genome alignment
# Requires: seqfile (paths to genomes + species tree)
# echo "((human:0.1,chimp:0.1):0.2,mouse:0.3);" > seqfile.txt
# echo "*human /path/human.fa" >> seqfile.txt
# echo "*chimp /path/chimp.fa" >> seqfile.txt
# echo "*mouse /path/mouse.fa" >> seqfile.txt
cactus jobStore seqfile.txt output.hal --maxCores 32
# Output: HAL format — use halStats, hal2maf, halSynteny to analyze
Download:
Thumb Rule — Tool Choice: For quick pairwise comparison: minimap2. For detailed structural analysis with dot plots: MUMmer. For multi-species alignment (≥3 genomes): Cactus (best accuracy) or SibeliaZ (faster). Progressive Mauve for bacteria with many rearrangements.

6.2 — Synteny Analysis & Visualization

Bash — GENESPACE (Recommended Modern Tool)
# GENESPACE: integrates OrthoFinder + MCScanX in one framework
# Requires: peptide FASTAs + BED files (gene positions) per genome

# Directory structure:
# genespace/
#   peptide/
#     speciesA.fa
#     speciesB.fa
#   bed/
#     speciesA.bed  (format: chr start end geneID)
#     speciesB.bed

# R script:
library(GENESPACE)

gpar <- init_genespace(
  genomeIDs = c("speciesA", "speciesB", "speciesC"),
  speciesIDs = c("speciesA", "speciesB", "speciesC"),
  versionIDs = c("v1", "v1", "v1"),
  ploidy = c(2, 2, 4),  # specify ploidy for WGD detection
  wd = "genespace/",
  nCores = 16,
  path2mcscanx = "/path/to/MCScanX/"
)

# Run the full pipeline
out <- run_genespace(gpar)

# Outputs:
# - Synteny dotplots (macro and micro)
# - Riparian/ribbon plots showing synteny blocks
# - Ortholog maps
# - Ks values per synteny block
Download:

MCScanX & Circos Visualization

Bash
# MCScanX: detect collinear blocks between/within genomes
# Input: BLAST results + GFF-like gene positions

# 1. All-vs-all BLASTP
makeblastdb -in all_proteins.fa -dbtype prot
blastp -query all_proteins.fa -db all_proteins.fa \
  -outfmt 6 -evalue 1e-10 -num_threads 16 \
  -max_target_seqs 5 > all.blast

# 2. Prepare gene position file (.gff format for MCScanX)
# Format: chr_id  gene_id  start  end
# sp1_chr1  sp1_gene001  1000  3000
# sp1_chr1  sp1_gene002  5000  7000

# 3. Run MCScanX
MCScanX all_prefix
# Output: all_prefix.collinearity (synteny blocks)

# 4. Circos visualization
# MCScanX includes: java circle_plotter.java -g gff -s collinearity -c config
# Or use Circos directly:
# circos -conf circos.conf

# 5. Microsynteny visualization with JCVI
python -m jcvi.compara.catalog ortholog speciesA speciesB --no_strip_names
python -m jcvi.compara.synteny depth --histogram speciesA.speciesB.anchors
python -m jcvi.graphics.dotplot speciesA.speciesB.anchors
python -m jcvi.graphics.karyotype seqids layout
Download:
Mnemonic

"Microsynteny = Minutes, Macrosynteny = Millions"

Microsynteny (conserved gene order in small blocks, 3–20 genes) is maintained over moderate evolutionary time. Macrosynteny (chromosome-level conservation) persists for hundreds of millions of years. MCScanX detects both; GENESPACE visualizes them beautifully.

6.3 — Whole Genome Duplication (WGD) Detection

Bash / Python — Ks Analysis
# wgd package: comprehensive WGD analysis
pip install wgd

# Step 1: Build paranome (all-vs-all within genome)
wgd dmd species.pep.fa

# Step 2: Compute Ks values for all paralog pairs
wgd ksd species.pep.fa.mcl species.cds.fa

# Step 3: Plot Ks distribution
wgd viz -d ks_output.tsv -o ks_plots/
# Peaks in the Ks distribution = WGD events
# Recent WGD: peak at low Ks (0.1-0.5)
# Ancient WGD: peak at high Ks (1.0-3.0)
# Background: exponential decay (small-scale duplications)

# Step 4: Mixture model fitting
# wgd mix: fits Gaussian mixture models to identify WGD peaks
wgd mix ks_output.tsv -n 1 2 3  # test 1, 2, 3 components

# ksrates: alternative, accounts for rate variation between species
# https://github.com/VIB-PSB/ksrates
ksrates init --config config.txt
ksrates orthologs
ksrates paralogs
ksrates plot

# Synteny-based WGD confirmation:
# A genuine WGD produces 2:1 synteny depth
# (each chromosome maps to 2 chromosomes in the duplicated genome)
python -m jcvi.compara.synteny depth --histogram anchors_file
Download:
WGD Evidence Hierarchy
  1. Ks peak — necessary but not sufficient (could be burst of tandem duplications)
  2. Synteny depth 2:1 — strong support when combined with Ks peak
  3. Fractionation bias — one subgenome retains more genes than the other (subgenome dominance)
  4. Phylogenomic dating — duplicated gene pairs coalesce at the same time

Claim a WGD only with ≥2 of these lines of evidence converging.

Thumb Rule — Ks Saturation: Ks > 3–5 is unreliable (mutations have saturated, multiple hits). Ancient WGDs (>300 Mya) often have Ks > 3 and are detectable only via synteny, not Ks peaks. For plant polyploidy, most events have Ks 0.3–2.0.

6.4 — Gene Family Evolution

Bash — CAFE5
# CAFE5: Computational Analysis of gene Family Evolution
# Detects gene families with significant expansion or contraction

# Input: gene family count table + dated species tree
# gene_counts.tsv:
# Desc  Family_ID  speciesA  speciesB  speciesC
# (null) fam1      3         5         2
# (null) fam2      1         1         4

# Run CAFE5
cafe5 -i gene_counts.tsv -t dated_species_tree.nwk \
  -p -k 2 -o cafe_output/

# -p: estimate separate birth/death rates per clade
# -k 2: two rate categories (fast families + slow families)

# Output analysis:
# cafe_output/Base_clade_results.txt — per-clade expansions/contractions
# cafe_output/Base_family_results.txt — per-family significance
# Significant families: p-value < 0.05 (corrected for multiple testing)

# Visualize with cafeplotter (R)
# https://github.com/xiangpin/cafeplotter
# Or: plot tree with +/- gene counts per branch
Download:

Gene Tree Reconciliation

Bash
# Notung: reconcile gene tree with species tree
# Detects: duplications, losses, transfers
java -jar Notung.jar -g gene_tree.nwk -s species_tree.nwk \
  --rearrange --threshold 90 --parsimony

# RANGER-DTL: reconciliation with duplication, transfer, loss
ranger-dtl -i input.txt -o output.txt
# input.txt: species tree + gene tree in Newick

# ALE: Amalgamated Likelihood Estimation
# Probabilistic reconciliation — handles uncertainty in gene trees
ALEobserve gene_tree.ufboot   # creates .ale file from bootstraps
ALEml_undated species_tree.nwk gene_tree.ufboot.ale
# Output: expected number of D, T, L events per branch
Mnemonic

"Duplication, Transfer, Loss — DTL reconciliation"

Gene trees disagree with species trees because of DTL events. Reconciliation algorithms map gene tree nodes to species tree branches and infer the minimum (parsimony) or most probable (likelihood) set of D, T, and L events. ALE handles gene tree uncertainty via bootstrap distributions.

6.5 — Chromosomal Rearrangements & Karyotype Evolution

Bash
# GRIMM: compute minimum rearrangement distance
# (inversions, translocations, fusions, fissions)
# Input: synteny blocks as signed permutations
# Genome A: 1 2 3 4 5
# Genome B: 1 -3 -2 4 5  (inversion of block 2-3)
grimm -f genomes.txt -d
# Output: minimum number of rearrangements + scenario

# DCJ (Double Cut and Join): more general rearrangement model
# Handles inversions, translocations, fusions, fissions equally
# UniMoG tool: http://bibiserv.cebitec.uni-bielefeld.de/dcj

# Ancestral karyotype reconstruction:
# 1. Identify synteny blocks across multiple species
# 2. Represent each genome as ordered block permutation
# 3. Use MGRA (Multiple Genome Rearrangement and Ancestors) or ANGES
#    to reconstruct ancestral chromosomal arrangement
# 4. Map rearrangements on phylogeny branch by branch

6.6 — Comparative Regulatory Genomics

Bash — phastCons & phyloP
# phastCons: identify conserved elements from multiple alignment
# Input: MAF (multiple alignment format) + species tree + neutral model

# Step 1: Estimate neutral model from 4-fold degenerate sites
phyloFit --tree species_tree.nwk --subst-mod REV \
  --msa-format MAF fourfold_sites.maf

# Step 2: Run phastCons
phastCons --target-coverage 0.3 --expected-length 45 \
  --most-conserved conserved_elements.bed \
  alignment.maf neutral_model.mod > scores.wig

# Step 3: phyloP — per-base conservation/acceleration scores
phyloP --method LRT --mode CONACC \
  neutral_model.mod alignment.maf > phylop_scores.wig

# Interpretation:
# phyloP > 0: conservation (purifying selection)
# phyloP < 0: acceleration (faster than neutral — possible positive selection)
# phyloP near 0: evolving neutrally

# Conserved Noncoding Elements (CNEs):
# Filter phastCons elements that do NOT overlap annotated exons
bedtools intersect -v -a conserved_elements.bed \
  -b exons.bed > CNEs.bed

# Human Accelerated Regions (HARs):
# Regions conserved in mammals but accelerated in human
# Pre-computed: download from UCSC or Zoonomia project
Download:
Byte
Comparative Regulatory Genomics Key Concepts:
  • Conserved Noncoding Elements (CNEs): likely regulatory (enhancers, silencers), often more conserved than coding regions in deeply divergent comparisons
  • Accelerated Regions (HARs, BARs): conserved in most species but rapidly evolving in one lineage — candidates for lineage-specific phenotypic innovation
  • Phylogenetic footprinting: aligning non-coding regions across species to find conserved TF binding motifs
  • ENCODE comparative data: functional genomics (ChIP-seq, ATAC-seq, RNA-seq) across species links sequence conservation to regulatory function

AI Prompt Guide

Comparative Genomics Analysis Pipeline
I have [NUMBER] chromosome-level genome assemblies from [SPECIES LIST]. Species tree divergence: ~[X] Mya. Each genome has annotated proteins and GFF3 files. I want to perform a comprehensive comparative genomics analysis: 1. Pairwise whole-genome alignment with MUMmer (dot plots) 2. Multi-species synteny analysis with GENESPACE 3. Ks distribution and WGD detection (wgd package + synteny depth) 4. Gene family expansion/contraction with CAFE5 5. Gene tree reconciliation for key families (ALE) 6. Conservation scoring with phastCons/phyloP from Cactus alignment 7. Identification of conserved noncoding elements (CNEs) 8. Circos plots for synteny visualization Please write complete Bash/R scripts for each step with: - Proper file format conversion between tools - SLURM scripts for HPC - Publication-quality figures - Interpretation guidelines

Troubleshooting

GENESPACE: "No syntenic regions found"

Check: (1) Gene IDs in BED files match protein FASTA headers exactly, (2) Genomes are not too divergent (>500 Mya species may have no detectable synteny), (3) BED file is sorted by chromosome and position, (4) Peptide files contain only the longest isoform per gene.

CAFE5: "Error: unreasonable lambda values"

The birth-death rate (lambda) is too high or the tree has very short branches. Solutions: (1) Filter out very large gene families (>100 copies), (2) Check the species tree is ultrametric (branch lengths in time), (3) Ensure gene counts are correct (no paralogs counted multiple times).

Ks plot: No clear peak, just exponential decay

This means no detectable WGD. The exponential decay is from small-scale (tandem) duplications + background gene birth-death. This is a valid biological result — not all lineages have WGDs. Don't force-fit Gaussian components to noise.

Cactus: Out of memory or extremely slow

Cactus is memory-hungry. For large genomes (>1 Gb): (1) Use the Minigraph-Cactus pipeline instead of progressive Cactus, (2) Set --maxCores and --maxMemory appropriately, (3) Use softmasked genomes (mask repeats first with RepeatMasker), (4) Run on a high-memory HPC node (512 GB+).

Mnemonics & Thumb Rules

Mnemonic

"Ks Peaks = Polyploidy, Ks Decay = Duplication background"

In a Ks distribution: peaks above the exponential background indicate WGD events. The exponential component is from ongoing small-scale duplications. Multiple peaks = multiple WGDs at different times.

Mnemonic

"2:1 depth = WGD, 1:1 = 1R"

Synteny depth tells you the ploidy history. If each region in genome A maps to 2 regions in genome B, genome B had a WGD after diverging from A. 3:1 = two WGDs (hexaploid) or whole genome triplication.

Thumb Rule — CAFE5 Gene Counts: Remove families with >100 genes in any species (likely transposable elements or annotation errors). Also remove single-copy universal genes (no variation to test). CAFE5 works best with families of 2–50 genes across species.
Thumb Rule — Conservation Scores: phastCons gives binary-like scores (conserved element yes/no). phyloP gives continuous scores per base (magnitude of conservation or acceleration). Use phastCons for element discovery, phyloP for per-site annotation.