Comparative Genomics & Synteny

Detect orthologs with OrthoFinder, map synteny blocks with MCScanX, visualize collinearity with dotplots & Circos, and identify whole-genome duplications with Ks analysis.

~80 min Intermediate–Advanced Bash / Python / R
Byte

1 Overview

Comparative genomics compares gene content, order, and structure across species to understand evolution. Key concepts:

ConceptDefinitionTool
OrthologsGenes in different species from a common ancestor (speciation)OrthoFinder, OrthoMCL
ParalogsGenes duplicated within a speciesOrthoFinder, MCScanX
SyntenyConserved gene order between genomesMCScanX, JCVI, SynMap2
CollinearitySynteny with preserved orientationMCScanX, iadhore
WGDWhole-Genome Duplication eventKs analysis, wgd tool
Mnemonic

"Orthologs = Other species, Paralogs = same Population"

Orthologs diverged by speciation (between species). Paralogs diverged by duplication (within a genome). OrthoFinder finds both.

2 OrthoFinder — Ortholog & Gene Family Detection

Bash
# Prepare: one protein FASTA per species in a directory
mkdir -p orthofinder_input
cp species1_proteins.fa orthofinder_input/Species1.fa
cp species2_proteins.fa orthofinder_input/Species2.fa
cp species3_proteins.fa orthofinder_input/Species3.fa

# Run OrthoFinder
orthofinder -f orthofinder_input/ -t 16 -a 8

# Key output files:
# Orthogroups/Orthogroups.tsv     — gene families
# Orthogroups/Orthogroups.GeneCount.tsv
# Comparative_Genomics/Orthogroups_SpeciesOverlaps.tsv
# Species_Tree/SpeciesTree_rooted.txt
# Orthologues/Species1__v__Species2.tsv
Download:
OrthoFinder Summary (3 plant species) Number of species: 3 Number of genes: 87,456 Number of orthogroups: 21,345 Orthogroups with all species: 15,678 (73.5%) Species-specific orthogroups: 2,345 Single-copy orthologs: 12,456 (useful for phylogenetics!)
Thumb Rule: Single-copy orthologs from OrthoFinder are the gold standard input for species tree construction. Expect 3,000–15,000 for flowering plants.

3 MCScanX — Synteny & Collinearity

MCScanX detects collinear blocks of genes between or within genomes.

Bash
# Step 1: Prepare input files
# species.gff — simplified gene positions: chr  gene  start  end
awk -F'\t' '$3=="gene" {
  match($9, /ID=([^;]+)/, id);
  print $1"\t"id[1]"\t"$4"\t"$5
}' species1.gff3 species2.gff3 > combined.gff

# Step 2: All-vs-all BLAST of protein sequences
cat species1_proteins.fa species2_proteins.fa > combined_proteins.fa
makeblastdb -in combined_proteins.fa -dbtype prot
blastp -query combined_proteins.fa -db combined_proteins.fa \
  -evalue 1e-10 -outfmt 6 -num_threads 16 \
  -max_target_seqs 5 -out combined.blast

# Step 3: Run MCScanX
MCScanX combined
# Outputs: combined.collinearity, combined.html, combined.tandem

# Step 4: Classify duplications
duplicate_gene_classifier combined
# Outputs gene pairs as: WGD/segmental, tandem, proximal, dispersed
Download:
MCScanX collinearity output ## Alignment 0: score=4567 e_value=2.3e-234 N=45 sp1_chr1&sp2_chr3 plus 0- 0: sp1_gene001 sp2_gene234 1e-156 0- 1: sp1_gene002 sp2_gene235 3e-98 0- 2: sp1_gene003 sp2_gene236 2e-145 ... 0- 44: sp1_gene045 sp2_gene278 8e-67 ## Alignment 1: score=3456 e_value=1.1e-189 N=38 sp1_chr2&sp2_chr5 minus ...
Byte key
MCScanX Input Format is Strict: The .gff file must have exactly 4 columns: chromosome gene_id start end. Gene IDs must match the BLAST output exactly. The prefix of the GFF and BLAST files must match (e.g., combined.gff and combined.blast).

4 JCVI / MCscan (Python) — Beautiful Synteny Plots

JCVI's MCscan module produces publication-quality synteny dotplots, macro-synteny, and micro-synteny plots.

Bash
# Install
pip install jcvi

# Step 1: Prepare BED and CDS files from GFF
python -m jcvi.formats.gff bed --type=mRNA --key=Name species1.gff3 -o species1.bed
python -m jcvi.formats.gff bed --type=mRNA --key=Name species2.gff3 -o species2.bed

# Step 2: Pairwise synteny search
python -m jcvi.compara.catalog ortholog species1 species2 --no_strip_names

# Step 3: Dotplot
python -m jcvi.graphics.dotplot species1.species2.anchors

# Step 4: Macro-synteny (karyotype view)
python -m jcvi.compara.synteny screen --minspan=30 \
  species1.species2.anchors species1.species2.anchors.new

# Create layout file for macro-synteny
echo "# y, xstart, xend, rotation, color, label, va, bed
.6, .1, .8, 0, , Species 1, top, species1.bed
.4, .1, .8, 0, , Species 2, bottom, species2.bed" > layout.csv

echo "# anchor file
species1.species2.anchors.simple" > seqids

python -m jcvi.graphics.karyotype seqids layout.csv

# Step 5: Micro-synteny (zoom into a region)
python -m jcvi.graphics.synteny species1.species2.anchors.simple \
  species1.bed species2.bed --genenames
Download:
Thumb Rule: JCVI produces the prettiest synteny figures for papers. MCScanX is better for downstream statistical analysis (Ks, duplication classification). Use both.

5 Reading Dotplots

Dotplots are the most informative way to visualize genome-wide synteny. Each dot = a syntenic gene pair.

How to Read a Dotplot
  • Diagonal line = collinear synteny (conserved gene order)
  • Off-diagonal lines = translocations or duplications
  • Parallel diagonals = whole-genome duplication (WGD)
  • Reversed diagonal = chromosomal inversion
  • Scattered dots = dispersed duplications or noise
Mnemonic

"Diagonal = Direct, Parallel = Polyploidy"

A single diagonal means 1:1 synteny. Multiple parallel diagonals mean the genome was duplicated (WGD). Count the diagonals: 2 = tetraploid ancestor, 3 = hexaploid.

6 Circos Plots

Circos creates circular visualizations showing synteny links between chromosomes.

R
# Using circlize in R (easier than Perl Circos)
library(circlize)

# Read synteny data (from MCScanX collinearity)
synteny <- read.delim("synteny_links.tsv", header=TRUE)
# Columns: chr1, start1, end1, chr2, start2, end2

# Chromosome lengths
karyotype <- read.delim("karyotype.tsv", header=TRUE)
# Columns: chr, start, end

# Initialize circular plot
circos.clear()
circos.par(gap.degree = 2, cell.padding = c(0, 0, 0, 0))
circos.genomicInitialize(karyotype)

# Add chromosome ideogram
circos.track(ylim = c(0, 1), bg.col = rainbow(nrow(karyotype), alpha=0.3),
             panel.fun = function(x, y) {
               circos.text(CELL_META$xcenter, 0.5, CELL_META$sector.index,
                           cex = 0.6, facing = "clockwise", niceFacing = TRUE)
             })

# Add synteny links
for(i in 1:nrow(synteny)) {
  circos.link(synteny$chr1[i], c(synteny$start1[i], synteny$end1[i]),
              synteny$chr2[i], c(synteny$start2[i], synteny$end2[i]),
              col = adjustcolor("dodgerblue", alpha=0.3), border=NA)
}
Download:

7 Ks Analysis & WGD Detection

Ks (synonymous substitution rate) plots reveal whole-genome duplication events as peaks.

Bash
# Using wgd tool
pip install wgd

# Calculate Ks for paralog pairs
wgd dmd proteins.fasta
wgd ksd cds.fasta proteins.fasta.mcl

# Plot Ks distribution
wgd viz -o ks_plot.pdf

# Using ParaAT + KaKs_Calculator (alternative)
ParaAT.pl -h homologs.txt -n cds.fa -a protein.fa -p proc_num -o paraAT_output
KaKs_Calculator -i paraAT_output/aligned.axt -o kaks_results.txt -m YN
Download:
Interpreting Ks Peaks Ks peak ~0.05-0.2 = recent tandem / segmental duplication Ks peak ~0.3-0.8 = recent WGD (e.g., within genus) Ks peak ~1.0-2.0 = ancient WGD (e.g., shared across family) Ks > 3.0 = saturated, unreliable Example: Arabidopsis shows Ks peaks at ~0.8 (alpha WGD) and ~2.0 (beta WGD, shared with other Brassicales)
Thumb Rule

Ks peaks = WGD events

Each Ks peak in a paralog Ks distribution = one WGD event. Lower Ks = more recent. Ks > 3 is saturated (too diverged to date). Use at least 1,000 gene pairs for reliable peaks.

8 AI Prompt Guide

Comparative Genomics for Your Species
I want to compare [NUMBER] genomes: [LIST SPECIES]. I have GFF3 annotation and protein FASTA for each. My goals are: [synteny analysis / ortholog detection / WGD dating / all]. Please write scripts that: 1. Run OrthoFinder on all species 2. Run MCScanX for pairwise synteny between [SPECIES PAIR] 3. Generate a dotplot and macro-synteny karyotype view with JCVI 4. Calculate Ks for paralog and ortholog pairs 5. Create a Circos plot showing synteny links 6. Identify WGD events from Ks peaks My organism group is [PLANTS / ANIMALS / FUNGI].

9 Common Errors

MCScanX: "0 syntenic blocks found"

Gene IDs in your GFF don't match the BLAST output. Check with: head combined.gff vs head combined.blast. IDs must be identical. Also ensure you have enough BLAST hits (-max_target_seqs 5).

JCVI: "No anchors found"

BED files may have wrong gene IDs or CDS file gene names don't match. Use --no_strip_names flag. Also verify BED columns: chr start end gene_id.

OrthoFinder: Runs for days

Normal for >10 species with >30k genes each. Use -t 32 for more BLAST threads and -a 16 for more analysis threads. Consider DIAMOND mode: -S diamond.

10 Organism-Specific Notes

Plants
  • Most flowering plants have undergone 1–3 WGDs — expect multiple Ks peaks
  • Polyploids (wheat, sugarcane) have complex synteny patterns — use subgenome-aware tools
  • MCScanX's duplicate_gene_classifier is especially useful for plants (distinguishes WGD from tandem)
Animals
  • Vertebrates share 2 ancient WGDs (1R, 2R) — visible at very high Ks
  • Insects have highly rearranged genomes — synteny blocks are shorter
  • Use Genomicus or Ensembl Compara for pre-computed synteny of model organisms
Microbes & Fungi
  • Horizontal gene transfer complicates ortholog detection
  • Fungi: use DAGchainer or MCScanX for smaller genomes
  • Bacteria: synteny conservation varies wildly even within species

11 All Comparative Genomics Tools

OrthoFinder

Gold standard ortholog detection + species tree

Orthologs
MCScanX

Synteny/collinearity + duplication classification

Synteny
JCVI / MCscan

Python synteny pipeline + beautiful plots

Synteny
MCScan (Tang)

Original MCScan, still used for Ks analysis

WGD
SynMap2

Web-based synteny (CoGe platform)

Web
OrthoMCL

Legacy ortholog clustering (needs MySQL)

Orthologs
DAGchainer

Lightweight synteny chain detection

Synteny
i-ADHoRe

Statistical collinearity detection

Synteny
Circos (Perl)

Circular genome visualizations

Viz
circlize (R)

R-based Circos-style plots

Viz
wgd

Ks plots + WGD detection pipeline

WGD
GENESPACE

R package for pangenome-scale synteny

Pangenome
Minimap2 + paftools

Quick whole-genome alignment dotplots

Alignment
SyRI

Structural rearrangement identification

Structure

Mnemonics & Thumb Rules

Mnemonic

"Find Orthologs → Synteny → Duplications → Visualize" — FOSDV

Standard comparative genomics workflow: OrthoFinder → MCScanX → Ks → Circos/dotplot.

Thumb Rule — Minimum for Synteny: You need ≥5 collinear gene pairs in a block for it to be statistically meaningful. MCScanX default is 5. Don't lower it below 3 or you'll get false positives.
Thumb Rule — BLAST for MCScanX: Use -evalue 1e-10 and -max_target_seqs 5. Too stringent = missed synteny. Too loose = noise overwhelms signal.

Summary

Byte
You can now:
  1. Detect orthologs and gene families with OrthoFinder
  2. Map synteny and collinearity with MCScanX
  3. Create publication-quality dotplots and karyotype views with JCVI
  4. Build Circos plots to show genome-wide synteny
  5. Date WGD events using Ks distributions