Identify orthologous gene families across nematode genomes with OrthoFinder and Opscan, then detect conserved synteny blocks with i-ADHoRe — mapping the evolutionary history of genome architecture.
Comparative genomics reveals the evolutionary forces shaping genomes. Synteny analysis identifies chromosomal regions that retain the same gene order across species — evidence for shared ancestry. This tutorial covers two complementary approaches:
All-vs-all protein BLAST to identify orthologous gene families across multiple genomes. Output: orthogroups, orthologs per species pair, gene duplication history.
Detects co-linear (syntenic) blocks from ortholog gene lists. Identifies whole-genome duplications (WGD), segmental duplications, and conserved chromosomal regions across species.
Comparative genomics projects involve large intermediate files from BLAST all-vs-all. Keep each tool's outputs in numbered directories so you can re-run any step independently.
-S diamond (10× faster than BLAST) or -S diamond_ultra_sens for maximum sensitivity. Store Results_*/WorkingDirectory/ in case you need to resume a failed run.| Term | Definition | Evolutionary event |
|---|---|---|
| Orthologs | Same gene in different species, related by speciation | Speciation (vertical inheritance) |
| Paralogs | Duplicated copies of a gene within the same or different species | Gene duplication |
| Xenologs | Genes related by horizontal gene transfer | HGT |
| Orthogroup | Group of genes descended from a single gene in the last common ancestor | Includes orthologs and paralogs |
Three plant-parasitic nematode genomes from WormBase ParaSite — all well-annotated, publicly available:
| Species | Common name | Genome size | Source |
|---|---|---|---|
| Heterodera glycines | Soybean cyst nematode | 93 Mb | WormBase ParaSite |
| Globodera pallida | Potato cyst nematode | 124 Mb | WormBase ParaSite |
| Meloidogyne hapla | Root-knot nematode | 54 Mb | WormBase ParaSite |
conda install -c bioconda orthofinder diamond
# or: module load orthofinder/2.5.2
mkdir -p proteins
# Download protein FASTA from WormBase ParaSite for each species
wget "https://parasite.wormbase.org/ftp.html" -q # see FTP directory for paths
# Example paths (actual URL structure):
# ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/current/species/
# heterodera_glycines/PRJNA381081/heterodera_glycines.PRJNA381081.WBPS18.protein.fa.gz
# globodera_pallida/PRJEB123/globodera_pallida.PRJEB123.WBPS18.protein.fa.gz
# meloidogyne_hapla/PRJNA29083/meloidogyne_hapla.PRJNA29083.WBPS18.protein.fa.gz
# Decompress and rename FASTA files
# OrthoFinder uses filename as species name
gzip -d *.gz
mv heterodera_glycines.*.protein.fa proteins/H_glycines.fa
mv globodera_pallida.*.protein.fa proteins/G_pallida.fa
mv meloidogyne_hapla.*.protein.fa proteins/M_hapla.fa
# Verify
for f in proteins/*.fa; do echo "$f: $(grep -c '>' $f) proteins"; done
# Run OrthoFinder
# -f = directory of protein FASTA files (one per species)
# -t = threads for BLAST
# -a = threads for OrthoFinder analysis
# -M msa = use multiple sequence alignment (more accurate)
orthofinder \
-f proteins/ \
-t 16 \
-a 8 \
-S diamond \
-o orthofinder_results/
# Key output files:
ls orthofinder_results/Results_*/
# Orthogroups/Orthogroups.tsv -- main orthogroup table
# Orthogroups/Orthogroups_SingleCopyOrthologues.txt
# Phylogenetically_Misplaced_Genes/
# Orthogroup_Sequences/ -- FASTA per orthogroup
| 📁-f proteins/ | Input folder containing one protein FASTA per species. OrthoFinder uses the filename as the species name in all outputs — choose meaningful filenames like H_glycines.fa rather than the original download names. |
| 🧵-t 16 / -a 8 | -t = threads for the all-vs-all protein comparison step (DIAMOND). -a = threads for the OrthoFinder analysis step (MCL clustering, gene tree inference). The DIAMOND step is the bottleneck, so give it more threads. |
| ⚡-S diamond | Use DIAMOND instead of BLAST for the all-vs-all protein comparison. ~10× faster with comparable accuracy for orthogroup detection. Use -S blast if you need maximum sensitivity for highly diverged sequences. |
| 📈Orthogroups.tsv | The main output: a tab-separated matrix of orthogroup ID vs each species, listing all gene members. Also see Orthogroups_SingleCopyOrthologues.txt for 1:1 orthologs — ideal for phylogenomics. |
Opscan takes protein FASTA and GFF files to build gene-order lists needed for synteny detection:
mkdir -p opscan_input
# Opscan expects FASTA with headers in specific format:
# >species_geneName scaffold_name start stop strand
# Example: >Hgly_mRNA1 scaffold_1 100 500 +
# Use bioawk to reformat headers from GFF + FASTA
# This extracts protein ID, scaffold, coordinates, and strand from GFF
module load bioawk
GFF="H_glycines.gff3"
PROT="proteins/H_glycines.fa"
# Extract gene positions from GFF
grep $'\tmRNA\t' ${GFF} | awk '{
split($9, a, ";")
for (i in a) {
if (a[i] ~ /^ID=/) {
gsub("ID=", "", a[i])
id = a[i]
}
}
print id, $1, $4, $5, $7
}' OFS='\t' > Hgly_gene_positions.tsv
# Reformat protein FASTA with positional headers
python3 << 'EOF'
import pandas as pd
from Bio import SeqIO
positions = pd.read_csv("Hgly_gene_positions.tsv", sep="\t",
names=["id","scaffold","start","stop","strand"])
pos_dict = positions.set_index("id").to_dict("index")
with open("opscan_input/H_glycines.fasta", "w") as out:
for rec in SeqIO.parse("proteins/H_glycines.fa", "fasta"):
gid = rec.id.split(".")[0] # strip isoform suffix
if gid in pos_dict:
p = pos_dict[gid]
header = f"Hgly_{gid} {p['scaffold']} {p['start']} {p['stop']} {p['strand']}"
out.write(f">{header}\n{str(rec.seq)}\n")
EOF
echo "Opscan input prepared: opscan_input/H_glycines.fasta"
head -4 opscan_input/H_glycines.fasta
# Download Opscan (free, from iADHoRe supplementary)
wget http://bioinformatics.psb.ugent.be/downloads/psb/iADHoRe/opscan.tar.gz
tar xzf opscan.tar.gz
cd opscan/ && make && cd ..
OPSCAN="opscan/opscan"
INDIR="opscan_input"
OUTDIR="opscan_results"
mkdir -p ${OUTDIR}
# Run all pairwise combinations
for sp1 in H_glycines G_pallida M_hapla; do
for sp2 in H_glycines G_pallida M_hapla; do
if [ "${sp1}" != "${sp2}" ]; then
PAIR="${sp1}_vs_${sp2}"
echo "Running Opscan: ${PAIR}"
${OPSCAN} \
${INDIR}/${sp1}.fasta \
${INDIR}/${sp2}.fasta \
> ${OUTDIR}/${PAIR}.orthologs.txt
echo " Ortholog pairs: $(wc -l < ${OUTDIR}/${PAIR}.orthologs.txt)"
fi
done
done
# Opscan output format (tab-separated):
# gene1_sp1 gene1_sp2 bitscore
# Hgly_mRNA1 Gpall_mRNA42 287
head -5 ${OUTDIR}/H_glycines_vs_G_pallida.orthologs.txt
| 🏷️FASTA header format | Opscan requires headers in >prefix_geneName scaffold start stop strand format. The prefix_ (e.g. Hgly_) identifies which species the gene belongs to in the output. Positional info allows Opscan to build the gene-order map for synteny detection. |
| 🔄All pairwise combinations | Opscan runs pairwise (not multi-species at once). For N species you need N×(N-1) runs. For 3 species: 6 pairs. The nested loop handles this automatically. Each run produces a bidirectional ortholog list. |
| 📊Output: gene1 gene2 bitscore | Each line is an ortholog pair with BLAST bitscore as confidence metric. Higher bitscore = better match. These pairs feed directly into i-ADHoRe to find collinear runs. |
# Run RepeatMasker to identify repetitive elements
module load repeatmasker
RepeatMasker \
-pa 16 \
-species nematoda \
-gff \
-dir repeatmasker_out/ \
genome_assembly.fasta
# Get coordinates of repeat-overlapping genes from RepeatMasker GFF
REPEAT_GFF="repeatmasker_out/genome_assembly.fasta.out.gff"
# Use bedtools to identify genes overlapping repeats
module load bedtools
# Convert gene GFF to BED
grep $'\tmRNA\t' H_glycines.gff3 | \
awk '{print $1, $4-1, $5, $9, ".", $7}' OFS='\t' \
> genes.bed
# Get repeat coordinates as BED
awk '/^[^#]/ {print $1, $4-1, $5}' OFS='\t' ${REPEAT_GFF} > repeats.bed
# Find genes with >50% overlap with repeats
bedtools coverage \
-a genes.bed \
-b repeats.bed \
| awk '$NF > 0.50 {print $4}' \
| sed 's/ID=//' \
> repeat_gene_ids.txt
echo "Genes overlapping repeats (>50%): $(wc -l < repeat_gene_ids.txt)"
# Remove these from Opscan outputs
for ortho in opscan_results/*.orthologs.txt; do
OUT="${ortho%.orthologs.txt}.filtered.txt"
grep -v -f repeat_gene_ids.txt ${ortho} > ${OUT}
echo "Filtered: $(wc -l < ${ortho}) -> $(wc -l < ${OUT}) pairs"
done
mkdir -p iadhore_input iadhore_output
# For each genome, create per-scaffold gene list files
# i-ADHoRe needs: gene order files, one file per scaffold
# Format: gene_name orientation (+ or -)
for SPECIES in H_glycines G_pallida M_hapla; do
SPECIES_DIR="iadhore_input/${SPECIES}"
mkdir -p ${SPECIES_DIR}
awk -v sp="${SPECIES}" '
/^\t/ { next } # skip header
{
print $2, $5 > (SPECIES_DIR "/" $1 ".lst")
# Creates one file per scaffold with gene names and strands
}' gene_positions_${SPECIES}.tsv
done
# Write i-ADHoRe configuration file
cat > iadhore.ini << 'EOF'
# i-ADHoRe 3.0 configuration file
# Genome 1: H. glycines
genome= H_glycines
H_glycines/scaffold_1.lst
H_glycines/scaffold_2.lst
# ... (list all scaffold files)
# Genome 2: G. pallida
genome= G_pallida
G_pallida/scaffold_1.lst
G_pallida/scaffold_2.lst
# Ortholog table (from Opscan output)
blast_table= opscan_results/H_glycines_vs_G_pallida.filtered.txt
# Output directory
output_path= iadhore_output/
# Parameters
gap_size= 30 # max gap between syntenic genes
cluster_gap= 35 # gap for merging clusters into blocks
q_value= 0.75 # FDR threshold
anchor_points= 3 # minimum collinear gene pairs
alignment_method= gg2 # greedy graph
max_gaps_in_alignment= 40
EOF
# Run i-ADHoRe
iadhore iadhore.ini
echo "i-ADHoRe complete. Synteny blocks:"
wc -l iadhore_output/anchorpoints.txt
echo "Output files:"
ls iadhore_output/
library(ggplot2)
library(dplyr)
# Load i-ADHoRe anchor points (syntenic gene pairs)
anchors <- read.table("iadhore_output/anchorpoints.txt",
header=TRUE, sep="\t")
head(anchors)
# Get gene positions for dot plot
positions_sp1 <- read.table("gene_positions_H_glycines.tsv",
col.names=c("id","scaffold","start","stop","strand"))
positions_sp2 <- read.table("gene_positions_G_pallida.tsv",
col.names=c("id","scaffold","start","stop","strand"))
# Join anchor points with positions
plot_data <- anchors %>%
left_join(positions_sp1 %>% select(id, scaffold, start),
by=c("gene_x"="id")) %>%
rename(scaffold_x=scaffold, pos_x=start) %>%
left_join(positions_sp2 %>% select(id, scaffold, start),
by=c("gene_y"="id")) %>%
rename(scaffold_y=scaffold, pos_y=start)
# Dot plot: each point is a syntenic gene pair
ggplot(plot_data, aes(x=pos_x/1e6, y=pos_y/1e6, color=scaffold_x)) +
geom_point(size=0.5, alpha=0.7) +
facet_grid(scaffold_y ~ scaffold_x, scales="free") +
labs(x="H. glycines position (Mb)",
y="G. pallida position (Mb)",
title="Synteny: H. glycines vs G. pallida") +
theme_minimal() +
theme(legend.position="none",
strip.text=element_text(size=6))
ggsave("synteny_dotplot.png", width=14, height=12, dpi=150)
cat("Synteny dot plot saved: synteny_dotplot.png\n")
This almost always indicates transposable element proteins (transposases, integrases, reverse transcriptases). Mask repeats before running OrthoFinder. Alternatively, filter orthogroups with unexpectedly high copy numbers (e.g., >3× the median gene copy per species).
Check: (1) Gene list files must be in iadhore_input/SPECIES/ and the INI file must list them correctly — even a typo in a path will silently produce no output. (2) Opscan ortholog file column order must match i-ADHoRe expectations (gene_sp1 TAB gene_sp2). (3) Try lowering anchor_points to 2 for distantly related species.
Opscan is strict about the FASTA header format. Each protein header must have exactly 5 space-separated fields: species_geneName scaffold start stop strand. Scaffold names cannot contain spaces or special characters. Test with 10 proteins first: head -20 input.fasta | ./opscan /dev/stdin input.fasta.