Whole Genome Alignment · Synteny & Collinearity · WGD Detection · Gene Family Evolution · Chromosomal Rearrangements · Regulatory Conservation
| Tool | Type | Best For | Speed |
|---|---|---|---|
| minimap2 | Pairwise, seed-chain-align | Quick pairwise, closely related genomes | Very fast |
| MUMmer4 (nucmer) | Pairwise, MUM-based | Detailed structural comparisons, dot plots | Fast |
| Cactus | Multiple, reference-free | Gold standard for multi-species alignment | Slow |
| Progressive Mauve | Multiple, LCB-based | Bacterial genomes, rearrangements | Moderate |
| SibeliaZ | Multiple, de Bruijn graph | Multiple large eukaryotic genomes | Moderate |
# MUMmer4: pairwise whole-genome alignment nucmer --maxmatch -l 100 -c 500 -t 16 \ reference.fasta query.fasta -p alignment # Filter alignments delta-filter -i 90 -l 1000 alignment.delta > filtered.delta # Generate dot plot mummerplot --fat --filter --png -p dotplot filtered.delta # Diagonal = collinear regions # Off-diagonal = rearrangements (inversions, translocations) # Generate 1-to-1 alignment coords show-coords -r -c -l filtered.delta > alignment.coords # minimap2: fast pairwise alignment minimap2 -ax asm5 reference.fasta query.fasta \ | samtools sort -o aligned.bam samtools index aligned.bam # Cactus: reference-free multiple genome alignment # Requires: seqfile (paths to genomes + species tree) # echo "((human:0.1,chimp:0.1):0.2,mouse:0.3);" > seqfile.txt # echo "*human /path/human.fa" >> seqfile.txt # echo "*chimp /path/chimp.fa" >> seqfile.txt # echo "*mouse /path/mouse.fa" >> seqfile.txt cactus jobStore seqfile.txt output.hal --maxCores 32 # Output: HAL format — use halStats, hal2maf, halSynteny to analyze
# GENESPACE: integrates OrthoFinder + MCScanX in one framework
# Requires: peptide FASTAs + BED files (gene positions) per genome
# Directory structure:
# genespace/
# peptide/
# speciesA.fa
# speciesB.fa
# bed/
# speciesA.bed (format: chr start end geneID)
# speciesB.bed
# R script:
library(GENESPACE)
gpar <- init_genespace(
genomeIDs = c("speciesA", "speciesB", "speciesC"),
speciesIDs = c("speciesA", "speciesB", "speciesC"),
versionIDs = c("v1", "v1", "v1"),
ploidy = c(2, 2, 4), # specify ploidy for WGD detection
wd = "genespace/",
nCores = 16,
path2mcscanx = "/path/to/MCScanX/"
)
# Run the full pipeline
out <- run_genespace(gpar)
# Outputs:
# - Synteny dotplots (macro and micro)
# - Riparian/ribbon plots showing synteny blocks
# - Ortholog maps
# - Ks values per synteny block
# MCScanX: detect collinear blocks between/within genomes # Input: BLAST results + GFF-like gene positions # 1. All-vs-all BLASTP makeblastdb -in all_proteins.fa -dbtype prot blastp -query all_proteins.fa -db all_proteins.fa \ -outfmt 6 -evalue 1e-10 -num_threads 16 \ -max_target_seqs 5 > all.blast # 2. Prepare gene position file (.gff format for MCScanX) # Format: chr_id gene_id start end # sp1_chr1 sp1_gene001 1000 3000 # sp1_chr1 sp1_gene002 5000 7000 # 3. Run MCScanX MCScanX all_prefix # Output: all_prefix.collinearity (synteny blocks) # 4. Circos visualization # MCScanX includes: java circle_plotter.java -g gff -s collinearity -c config # Or use Circos directly: # circos -conf circos.conf # 5. Microsynteny visualization with JCVI python -m jcvi.compara.catalog ortholog speciesA speciesB --no_strip_names python -m jcvi.compara.synteny depth --histogram speciesA.speciesB.anchors python -m jcvi.graphics.dotplot speciesA.speciesB.anchors python -m jcvi.graphics.karyotype seqids layout
"Microsynteny = Minutes, Macrosynteny = Millions"
Microsynteny (conserved gene order in small blocks, 3–20 genes) is maintained over moderate evolutionary time. Macrosynteny (chromosome-level conservation) persists for hundreds of millions of years. MCScanX detects both; GENESPACE visualizes them beautifully.
# wgd package: comprehensive WGD analysis pip install wgd # Step 1: Build paranome (all-vs-all within genome) wgd dmd species.pep.fa # Step 2: Compute Ks values for all paralog pairs wgd ksd species.pep.fa.mcl species.cds.fa # Step 3: Plot Ks distribution wgd viz -d ks_output.tsv -o ks_plots/ # Peaks in the Ks distribution = WGD events # Recent WGD: peak at low Ks (0.1-0.5) # Ancient WGD: peak at high Ks (1.0-3.0) # Background: exponential decay (small-scale duplications) # Step 4: Mixture model fitting # wgd mix: fits Gaussian mixture models to identify WGD peaks wgd mix ks_output.tsv -n 1 2 3 # test 1, 2, 3 components # ksrates: alternative, accounts for rate variation between species # https://github.com/VIB-PSB/ksrates ksrates init --config config.txt ksrates orthologs ksrates paralogs ksrates plot # Synteny-based WGD confirmation: # A genuine WGD produces 2:1 synteny depth # (each chromosome maps to 2 chromosomes in the duplicated genome) python -m jcvi.compara.synteny depth --histogram anchors_file
Claim a WGD only with ≥2 of these lines of evidence converging.
# CAFE5: Computational Analysis of gene Family Evolution # Detects gene families with significant expansion or contraction # Input: gene family count table + dated species tree # gene_counts.tsv: # Desc Family_ID speciesA speciesB speciesC # (null) fam1 3 5 2 # (null) fam2 1 1 4 # Run CAFE5 cafe5 -i gene_counts.tsv -t dated_species_tree.nwk \ -p -k 2 -o cafe_output/ # -p: estimate separate birth/death rates per clade # -k 2: two rate categories (fast families + slow families) # Output analysis: # cafe_output/Base_clade_results.txt — per-clade expansions/contractions # cafe_output/Base_family_results.txt — per-family significance # Significant families: p-value < 0.05 (corrected for multiple testing) # Visualize with cafeplotter (R) # https://github.com/xiangpin/cafeplotter # Or: plot tree with +/- gene counts per branch
# Notung: reconcile gene tree with species tree # Detects: duplications, losses, transfers java -jar Notung.jar -g gene_tree.nwk -s species_tree.nwk \ --rearrange --threshold 90 --parsimony # RANGER-DTL: reconciliation with duplication, transfer, loss ranger-dtl -i input.txt -o output.txt # input.txt: species tree + gene tree in Newick # ALE: Amalgamated Likelihood Estimation # Probabilistic reconciliation — handles uncertainty in gene trees ALEobserve gene_tree.ufboot # creates .ale file from bootstraps ALEml_undated species_tree.nwk gene_tree.ufboot.ale # Output: expected number of D, T, L events per branch
"Duplication, Transfer, Loss — DTL reconciliation"
Gene trees disagree with species trees because of DTL events. Reconciliation algorithms map gene tree nodes to species tree branches and infer the minimum (parsimony) or most probable (likelihood) set of D, T, and L events. ALE handles gene tree uncertainty via bootstrap distributions.
# GRIMM: compute minimum rearrangement distance # (inversions, translocations, fusions, fissions) # Input: synteny blocks as signed permutations # Genome A: 1 2 3 4 5 # Genome B: 1 -3 -2 4 5 (inversion of block 2-3) grimm -f genomes.txt -d # Output: minimum number of rearrangements + scenario # DCJ (Double Cut and Join): more general rearrangement model # Handles inversions, translocations, fusions, fissions equally # UniMoG tool: http://bibiserv.cebitec.uni-bielefeld.de/dcj # Ancestral karyotype reconstruction: # 1. Identify synteny blocks across multiple species # 2. Represent each genome as ordered block permutation # 3. Use MGRA (Multiple Genome Rearrangement and Ancestors) or ANGES # to reconstruct ancestral chromosomal arrangement # 4. Map rearrangements on phylogeny branch by branch
# phastCons: identify conserved elements from multiple alignment # Input: MAF (multiple alignment format) + species tree + neutral model # Step 1: Estimate neutral model from 4-fold degenerate sites phyloFit --tree species_tree.nwk --subst-mod REV \ --msa-format MAF fourfold_sites.maf # Step 2: Run phastCons phastCons --target-coverage 0.3 --expected-length 45 \ --most-conserved conserved_elements.bed \ alignment.maf neutral_model.mod > scores.wig # Step 3: phyloP — per-base conservation/acceleration scores phyloP --method LRT --mode CONACC \ neutral_model.mod alignment.maf > phylop_scores.wig # Interpretation: # phyloP > 0: conservation (purifying selection) # phyloP < 0: acceleration (faster than neutral — possible positive selection) # phyloP near 0: evolving neutrally # Conserved Noncoding Elements (CNEs): # Filter phastCons elements that do NOT overlap annotated exons bedtools intersect -v -a conserved_elements.bed \ -b exons.bed > CNEs.bed # Human Accelerated Regions (HARs): # Regions conserved in mammals but accelerated in human # Pre-computed: download from UCSC or Zoonomia project
Check: (1) Gene IDs in BED files match protein FASTA headers exactly, (2) Genomes are not too divergent (>500 Mya species may have no detectable synteny), (3) BED file is sorted by chromosome and position, (4) Peptide files contain only the longest isoform per gene.
The birth-death rate (lambda) is too high or the tree has very short branches. Solutions: (1) Filter out very large gene families (>100 copies), (2) Check the species tree is ultrametric (branch lengths in time), (3) Ensure gene counts are correct (no paralogs counted multiple times).
This means no detectable WGD. The exponential decay is from small-scale (tandem) duplications + background gene birth-death. This is a valid biological result — not all lineages have WGDs. Don't force-fit Gaussian components to noise.
Cactus is memory-hungry. For large genomes (>1 Gb): (1) Use the Minigraph-Cactus pipeline instead of progressive Cactus, (2) Set --maxCores and --maxMemory appropriately, (3) Use softmasked genomes (mask repeats first with RepeatMasker), (4) Run on a high-memory HPC node (512 GB+).
"Ks Peaks = Polyploidy, Ks Decay = Duplication background"
In a Ks distribution: peaks above the exponential background indicate WGD events. The exponential component is from ongoing small-scale duplications. Multiple peaks = multiple WGDs at different times.
"2:1 depth = WGD, 1:1 = 1R"
Synteny depth tells you the ploidy history. If each region in genome A maps to 2 regions in genome B, genome B had a WGD after diverging from A. 3:1 = two WGDs (hexaploid) or whole genome triplication.