Detect orthologs with OrthoFinder, map synteny blocks with MCScanX, visualize collinearity with dotplots & Circos, and identify whole-genome duplications with Ks analysis.
Comparative genomics compares gene content, order, and structure across species to understand evolution. Key concepts:
| Concept | Definition | Tool |
|---|---|---|
| Orthologs | Genes in different species from a common ancestor (speciation) | OrthoFinder, OrthoMCL |
| Paralogs | Genes duplicated within a species | OrthoFinder, MCScanX |
| Synteny | Conserved gene order between genomes | MCScanX, JCVI, SynMap2 |
| Collinearity | Synteny with preserved orientation | MCScanX, iadhore |
| WGD | Whole-Genome Duplication event | Ks analysis, wgd tool |
"Orthologs = Other species, Paralogs = same Population"
Orthologs diverged by speciation (between species). Paralogs diverged by duplication (within a genome). OrthoFinder finds both.
# Prepare: one protein FASTA per species in a directory mkdir -p orthofinder_input cp species1_proteins.fa orthofinder_input/Species1.fa cp species2_proteins.fa orthofinder_input/Species2.fa cp species3_proteins.fa orthofinder_input/Species3.fa # Run OrthoFinder orthofinder -f orthofinder_input/ -t 16 -a 8 # Key output files: # Orthogroups/Orthogroups.tsv — gene families # Orthogroups/Orthogroups.GeneCount.tsv # Comparative_Genomics/Orthogroups_SpeciesOverlaps.tsv # Species_Tree/SpeciesTree_rooted.txt # Orthologues/Species1__v__Species2.tsv
MCScanX detects collinear blocks of genes between or within genomes.
# Step 1: Prepare input files
# species.gff — simplified gene positions: chr gene start end
awk -F'\t' '$3=="gene" {
match($9, /ID=([^;]+)/, id);
print $1"\t"id[1]"\t"$4"\t"$5
}' species1.gff3 species2.gff3 > combined.gff
# Step 2: All-vs-all BLAST of protein sequences
cat species1_proteins.fa species2_proteins.fa > combined_proteins.fa
makeblastdb -in combined_proteins.fa -dbtype prot
blastp -query combined_proteins.fa -db combined_proteins.fa \
-evalue 1e-10 -outfmt 6 -num_threads 16 \
-max_target_seqs 5 -out combined.blast
# Step 3: Run MCScanX
MCScanX combined
# Outputs: combined.collinearity, combined.html, combined.tandem
# Step 4: Classify duplications
duplicate_gene_classifier combined
# Outputs gene pairs as: WGD/segmental, tandem, proximal, dispersed
.gff file must have exactly 4 columns: chromosome gene_id start end. Gene IDs must match the BLAST output exactly. The prefix of the GFF and BLAST files must match (e.g., combined.gff and combined.blast).
JCVI's MCscan module produces publication-quality synteny dotplots, macro-synteny, and micro-synteny plots.
# Install pip install jcvi # Step 1: Prepare BED and CDS files from GFF python -m jcvi.formats.gff bed --type=mRNA --key=Name species1.gff3 -o species1.bed python -m jcvi.formats.gff bed --type=mRNA --key=Name species2.gff3 -o species2.bed # Step 2: Pairwise synteny search python -m jcvi.compara.catalog ortholog species1 species2 --no_strip_names # Step 3: Dotplot python -m jcvi.graphics.dotplot species1.species2.anchors # Step 4: Macro-synteny (karyotype view) python -m jcvi.compara.synteny screen --minspan=30 \ species1.species2.anchors species1.species2.anchors.new # Create layout file for macro-synteny echo "# y, xstart, xend, rotation, color, label, va, bed .6, .1, .8, 0, , Species 1, top, species1.bed .4, .1, .8, 0, , Species 2, bottom, species2.bed" > layout.csv echo "# anchor file species1.species2.anchors.simple" > seqids python -m jcvi.graphics.karyotype seqids layout.csv # Step 5: Micro-synteny (zoom into a region) python -m jcvi.graphics.synteny species1.species2.anchors.simple \ species1.bed species2.bed --genenames
Dotplots are the most informative way to visualize genome-wide synteny. Each dot = a syntenic gene pair.
"Diagonal = Direct, Parallel = Polyploidy"
A single diagonal means 1:1 synteny. Multiple parallel diagonals mean the genome was duplicated (WGD). Count the diagonals: 2 = tetraploid ancestor, 3 = hexaploid.
Circos creates circular visualizations showing synteny links between chromosomes.
# Using circlize in R (easier than Perl Circos)
library(circlize)
# Read synteny data (from MCScanX collinearity)
synteny <- read.delim("synteny_links.tsv", header=TRUE)
# Columns: chr1, start1, end1, chr2, start2, end2
# Chromosome lengths
karyotype <- read.delim("karyotype.tsv", header=TRUE)
# Columns: chr, start, end
# Initialize circular plot
circos.clear()
circos.par(gap.degree = 2, cell.padding = c(0, 0, 0, 0))
circos.genomicInitialize(karyotype)
# Add chromosome ideogram
circos.track(ylim = c(0, 1), bg.col = rainbow(nrow(karyotype), alpha=0.3),
panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, 0.5, CELL_META$sector.index,
cex = 0.6, facing = "clockwise", niceFacing = TRUE)
})
# Add synteny links
for(i in 1:nrow(synteny)) {
circos.link(synteny$chr1[i], c(synteny$start1[i], synteny$end1[i]),
synteny$chr2[i], c(synteny$start2[i], synteny$end2[i]),
col = adjustcolor("dodgerblue", alpha=0.3), border=NA)
}
Ks (synonymous substitution rate) plots reveal whole-genome duplication events as peaks.
# Using wgd tool pip install wgd # Calculate Ks for paralog pairs wgd dmd proteins.fasta wgd ksd cds.fasta proteins.fasta.mcl # Plot Ks distribution wgd viz -o ks_plot.pdf # Using ParaAT + KaKs_Calculator (alternative) ParaAT.pl -h homologs.txt -n cds.fa -a protein.fa -p proc_num -o paraAT_output KaKs_Calculator -i paraAT_output/aligned.axt -o kaks_results.txt -m YN
Ks peaks = WGD events
Each Ks peak in a paralog Ks distribution = one WGD event. Lower Ks = more recent. Ks > 3 is saturated (too diverged to date). Use at least 1,000 gene pairs for reliable peaks.
Gene IDs in your GFF don't match the BLAST output. Check with: head combined.gff vs head combined.blast. IDs must be identical. Also ensure you have enough BLAST hits (-max_target_seqs 5).
BED files may have wrong gene IDs or CDS file gene names don't match. Use --no_strip_names flag. Also verify BED columns: chr start end gene_id.
Normal for >10 species with >30k genes each. Use -t 32 for more BLAST threads and -a 16 for more analysis threads. Consider DIAMOND mode: -S diamond.
duplicate_gene_classifier is especially useful for plants (distinguishes WGD from tandem)Gold standard ortholog detection + species tree
OrthologsSynteny/collinearity + duplication classification
SyntenyPython synteny pipeline + beautiful plots
SyntenyOriginal MCScan, still used for Ks analysis
WGDWeb-based synteny (CoGe platform)
WebLegacy ortholog clustering (needs MySQL)
OrthologsLightweight synteny chain detection
SyntenyStatistical collinearity detection
SyntenyCircular genome visualizations
VizR-based Circos-style plots
VizKs plots + WGD detection pipeline
WGDR package for pangenome-scale synteny
PangenomeQuick whole-genome alignment dotplots
AlignmentStructural rearrangement identification
Structure"Find Orthologs → Synteny → Duplications → Visualize" — FOSDV
Standard comparative genomics workflow: OrthoFinder → MCScanX → Ks → Circos/dotplot.
-evalue 1e-10 and -max_target_seqs 5. Too stringent = missed synteny. Too loose = noise overwhelms signal.