Gene Orthology & Synteny Analysis

Overview

Comparative genomics reveals the evolutionary forces shaping genomes. Synteny analysis identifies chromosomal regions that retain the same gene order across species — evidence for shared ancestry. This tutorial covers two complementary approaches:

OrthoFinder / Opscan

All-vs-all protein BLAST to identify orthologous gene families across multiple genomes. Output: orthogroups, orthologs per species pair, gene duplication history.

i-ADHoRe

Detects co-linear (syntenic) blocks from ortholog gene lists. Identifies whole-genome duplications (WGD), segmental duplications, and conserved chromosomal regions across species.

Ideal Directory Structure

Comparative genomics projects involve large intermediate files from BLAST all-vs-all. Keep each tool's outputs in numbered directories so you can re-run any step independently.

comparative_genomics_nematodes/

comparative_genomics_nematodes/ ├── proteins/ # one protein FASTA per species (OrthoFinder input) │ ├── H_glycines.fa │ ├── G_pallida.fa │ └── M_hapla.fa ├── 1_orthofinder/ # OrthoFinder all-vs-all BLAST + orthogroups │ └── Orthogroups.tsv # main output: orthogroups × species matrix ├── 2_opscan/ # Opscan pairwise ortholog detection │ ├── input/ # positionally-formatted FASTA files │ └── Hgly_vs_Gpall.orthologs.txt ├── 3_repeatmasker/ # repeat masking for repeat gene filtering ├── 4_iadhore/ # i-ADHoRe synteny detection │ ├── iadhore.ini # configuration file │ ├── input/ # per-scaffold gene order lists │ └── anchorpoints.txt # collinear gene pairs (key output) └── 5_visualization/ # synteny plots (dotplots, chord diagrams) └── synteny_dotplot.png

⏱️

Runtime reality check: OrthoFinder DIAMOND all-vs-all for 3 nematode genomes (~20,000 proteins each) takes ~2–6 hours on 16 threads. The BLAST phase is the bottleneck. Use -S diamond (10× faster than BLAST) or -S diamond_ultra_sens for maximum sensitivity. Store Results_*/WorkingDirectory/ in case you need to resume a failed run.

Orthologs vs Paralogs vs Xenologs

Term	Definition	Evolutionary event
Orthologs	Same gene in different species, related by speciation	Speciation (vertical inheritance)
Paralogs	Duplicated copies of a gene within the same or different species	Gene duplication
Xenologs	Genes related by horizontal gene transfer	HGT
Orthogroup	Group of genes descended from a single gene in the last common ancestor	Includes orthologs and paralogs

Why filter repetitive elements? Transposable elements and other repeats encode proteins (transposases, reverse transcriptases) that BLAST readily against each other. If included, they create enormous spurious orthogroups swamping the real signal. Always soft-mask or hard-mask repeats before running OrthoFinder or Opscan.

Dataset

Three plant-parasitic nematode genomes from WormBase ParaSite — all well-annotated, publicly available:

Species	Common name	Genome size	Source
Heterodera glycines	Soybean cyst nematode	93 Mb	WormBase ParaSite
Globodera pallida	Potato cyst nematode	124 Mb	WormBase ParaSite
Meloidogyne hapla	Root-knot nematode	54 Mb	WormBase ParaSite

Step 1 — OrthoFinder: All-vs-All Ortholog Detection

Bash — OrthoFinder setup and run

conda install -c bioconda orthofinder diamond
# or: module load orthofinder/2.5.2

mkdir -p proteins

# Download protein FASTA from WormBase ParaSite for each species
wget "https://parasite.wormbase.org/ftp.html" -q  # see FTP directory for paths
# Example paths (actual URL structure):
# ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/current/species/
#   heterodera_glycines/PRJNA381081/heterodera_glycines.PRJNA381081.WBPS18.protein.fa.gz
#   globodera_pallida/PRJEB123/globodera_pallida.PRJEB123.WBPS18.protein.fa.gz
#   meloidogyne_hapla/PRJNA29083/meloidogyne_hapla.PRJNA29083.WBPS18.protein.fa.gz

# Decompress and rename FASTA files
# OrthoFinder uses filename as species name
gzip -d *.gz
mv heterodera_glycines.*.protein.fa proteins/H_glycines.fa
mv globodera_pallida.*.protein.fa proteins/G_pallida.fa
mv meloidogyne_hapla.*.protein.fa proteins/M_hapla.fa

# Verify
for f in proteins/*.fa; do echo "$f: $(grep -c '>' $f) proteins"; done

# Run OrthoFinder
# -f = directory of protein FASTA files (one per species)
# -t = threads for BLAST
# -a = threads for OrthoFinder analysis
# -M msa = use multiple sequence alignment (more accurate)
orthofinder \
    -f proteins/ \
    -t 16 \
    -a 8 \
    -S diamond \
    -o orthofinder_results/

# Key output files:
ls orthofinder_results/Results_*/
# Orthogroups/Orthogroups.tsv                 -- main orthogroup table
# Orthogroups/Orthogroups_SingleCopyOrthologues.txt
# Phylogenetically_Misplaced_Genes/
# Orthogroup_Sequences/                        -- FASTA per orthogroup

Download:

OrthoFinder flags decoded

📁-f proteins/	Input folder containing one protein FASTA per species. OrthoFinder uses the filename as the species name in all outputs — choose meaningful filenames like `H_glycines.fa` rather than the original download names.
🧵-t 16 / -a 8	`-t` = threads for the all-vs-all protein comparison step (DIAMOND). `-a` = threads for the OrthoFinder analysis step (MCL clustering, gene tree inference). The DIAMOND step is the bottleneck, so give it more threads.
⚡-S diamond	Use DIAMOND instead of BLAST for the all-vs-all protein comparison. ~10× faster with comparable accuracy for orthogroup detection. Use `-S blast` if you need maximum sensitivity for highly diverged sequences.
📈Orthogroups.tsv	The main output: a tab-separated matrix of orthogroup ID vs each species, listing all gene members. Also see `Orthogroups_SingleCopyOrthologues.txt` for 1:1 orthologs — ideal for phylogenomics.

🌳

OrthoFinder's algorithm: It does all-vs-all DIAMOND, normalizes scores by sequence length and search space, applies RBH (reciprocal best-hit) per species pair, then uses MCL (Markov Clustering) to group sequences into orthogroups. It also infers a species tree and root gene trees to properly separate orthologs from paralogs.

Step 2 — Prepare Opscan Input Files

Opscan takes protein FASTA and GFF files to build gene-order lists needed for synteny detection:

Bash — Format FASTA headers for Opscan

mkdir -p opscan_input

# Opscan expects FASTA with headers in specific format:
# >species_geneName scaffold_name start stop strand
# Example: >Hgly_mRNA1 scaffold_1 100 500 +

# Use bioawk to reformat headers from GFF + FASTA
# This extracts protein ID, scaffold, coordinates, and strand from GFF
module load bioawk

GFF="H_glycines.gff3"
PROT="proteins/H_glycines.fa"

# Extract gene positions from GFF
grep $'\tmRNA\t' ${GFF} | awk '{
    split($9, a, ";")
    for (i in a) {
        if (a[i] ~ /^ID=/) {
            gsub("ID=", "", a[i])
            id = a[i]
        }
    }
    print id, $1, $4, $5, $7
}' OFS='\t' > Hgly_gene_positions.tsv

# Reformat protein FASTA with positional headers
python3 << 'EOF'
import pandas as pd
from Bio import SeqIO

positions = pd.read_csv("Hgly_gene_positions.tsv", sep="\t",
                         names=["id","scaffold","start","stop","strand"])
pos_dict = positions.set_index("id").to_dict("index")

with open("opscan_input/H_glycines.fasta", "w") as out:
    for rec in SeqIO.parse("proteins/H_glycines.fa", "fasta"):
        gid = rec.id.split(".")[0]  # strip isoform suffix
        if gid in pos_dict:
            p = pos_dict[gid]
            header = f"Hgly_{gid} {p['scaffold']} {p['start']} {p['stop']} {p['strand']}"
            out.write(f">{header}\n{str(rec.seq)}\n")
EOF
echo "Opscan input prepared: opscan_input/H_glycines.fasta"
head -4 opscan_input/H_glycines.fasta

Download:

Step 3 — Run Opscan for Pairwise Orthologs

Bash — Opscan pairwise ortholog detection

# Download Opscan (free, from iADHoRe supplementary)
wget http://bioinformatics.psb.ugent.be/downloads/psb/iADHoRe/opscan.tar.gz
tar xzf opscan.tar.gz
cd opscan/ && make && cd ..

OPSCAN="opscan/opscan"
INDIR="opscan_input"
OUTDIR="opscan_results"
mkdir -p ${OUTDIR}

# Run all pairwise combinations
for sp1 in H_glycines G_pallida M_hapla; do
    for sp2 in H_glycines G_pallida M_hapla; do
        if [ "${sp1}" != "${sp2}" ]; then
            PAIR="${sp1}_vs_${sp2}"
            echo "Running Opscan: ${PAIR}"

            ${OPSCAN} \
                ${INDIR}/${sp1}.fasta \
                ${INDIR}/${sp2}.fasta \
                > ${OUTDIR}/${PAIR}.orthologs.txt

            echo "  Ortholog pairs: $(wc -l < ${OUTDIR}/${PAIR}.orthologs.txt)"
        fi
    done
done

# Opscan output format (tab-separated):
# gene1_sp1  gene1_sp2  bitscore
# Hgly_mRNA1  Gpall_mRNA42  287
head -5 ${OUTDIR}/H_glycines_vs_G_pallida.orthologs.txt

Download:

Opscan and gene position formatting decoded

🏷️FASTA header format	Opscan requires headers in `>prefix_geneName scaffold start stop strand` format. The `prefix_` (e.g. `Hgly_`) identifies which species the gene belongs to in the output. Positional info allows Opscan to build the gene-order map for synteny detection.
🔄All pairwise combinations	Opscan runs pairwise (not multi-species at once). For N species you need N×(N-1) runs. For 3 species: 6 pairs. The nested loop handles this automatically. Each run produces a bidirectional ortholog list.
📊Output: gene1 gene2 bitscore	Each line is an ortholog pair with BLAST bitscore as confidence metric. Higher bitscore = better match. These pairs feed directly into i-ADHoRe to find collinear runs.

Step 4 — Filter Repetitive Genes Before Synteny

Bash — Remove repeat-associated genes from ortholog lists

# Run RepeatMasker to identify repetitive elements
module load repeatmasker

RepeatMasker \
    -pa 16 \
    -species nematoda \
    -gff \
    -dir repeatmasker_out/ \
    genome_assembly.fasta

# Get coordinates of repeat-overlapping genes from RepeatMasker GFF
REPEAT_GFF="repeatmasker_out/genome_assembly.fasta.out.gff"

# Use bedtools to identify genes overlapping repeats
module load bedtools

# Convert gene GFF to BED
grep $'\tmRNA\t' H_glycines.gff3 | \
    awk '{print $1, $4-1, $5, $9, ".", $7}' OFS='\t' \
    > genes.bed

# Get repeat coordinates as BED
awk '/^[^#]/ {print $1, $4-1, $5}' OFS='\t' ${REPEAT_GFF} > repeats.bed

# Find genes with >50% overlap with repeats
bedtools coverage \
    -a genes.bed \
    -b repeats.bed \
    | awk '$NF > 0.50 {print $4}' \
    | sed 's/ID=//' \
    > repeat_gene_ids.txt

echo "Genes overlapping repeats (>50%): $(wc -l < repeat_gene_ids.txt)"

# Remove these from Opscan outputs
for ortho in opscan_results/*.orthologs.txt; do
    OUT="${ortho%.orthologs.txt}.filtered.txt"
    grep -v -f repeat_gene_ids.txt ${ortho} > ${OUT}
    echo "Filtered: $(wc -l < ${ortho}) -> $(wc -l < ${OUT}) pairs"
done

Download:

Step 5 — i-ADHoRe Synteny Block Detection

Bash — i-ADHoRe configuration and run

mkdir -p iadhore_input iadhore_output

# For each genome, create per-scaffold gene list files
# i-ADHoRe needs: gene order files, one file per scaffold
# Format: gene_name orientation (+ or -)

for SPECIES in H_glycines G_pallida M_hapla; do
    SPECIES_DIR="iadhore_input/${SPECIES}"
    mkdir -p ${SPECIES_DIR}

    awk -v sp="${SPECIES}" '
    /^\t/ { next }  # skip header
    {
        print $2, $5 > (SPECIES_DIR "/" $1 ".lst")
        # Creates one file per scaffold with gene names and strands
    }' gene_positions_${SPECIES}.tsv
done

# Write i-ADHoRe configuration file
cat > iadhore.ini << 'EOF'
# i-ADHoRe 3.0 configuration file

# Genome 1: H. glycines
genome= H_glycines
H_glycines/scaffold_1.lst
H_glycines/scaffold_2.lst
# ... (list all scaffold files)

# Genome 2: G. pallida
genome= G_pallida
G_pallida/scaffold_1.lst
G_pallida/scaffold_2.lst

# Ortholog table (from Opscan output)
blast_table= opscan_results/H_glycines_vs_G_pallida.filtered.txt

# Output directory
output_path= iadhore_output/

# Parameters
gap_size= 30         # max gap between syntenic genes
cluster_gap= 35      # gap for merging clusters into blocks
q_value= 0.75        # FDR threshold
anchor_points= 3     # minimum collinear gene pairs
alignment_method= gg2  # greedy graph
max_gaps_in_alignment= 40
EOF

# Run i-ADHoRe
iadhore iadhore.ini

echo "i-ADHoRe complete. Synteny blocks:"
wc -l iadhore_output/anchorpoints.txt
echo "Output files:"
ls iadhore_output/

Download:

Step 6 — Synteny Dot Plots and Chord Diagrams

R — Synteny dot plot from i-ADHoRe anchorpoints

library(ggplot2)
library(dplyr)

# Load i-ADHoRe anchor points (syntenic gene pairs)
anchors <- read.table("iadhore_output/anchorpoints.txt",
                       header=TRUE, sep="\t")
head(anchors)

# Get gene positions for dot plot
positions_sp1 <- read.table("gene_positions_H_glycines.tsv",
                              col.names=c("id","scaffold","start","stop","strand"))
positions_sp2 <- read.table("gene_positions_G_pallida.tsv",
                              col.names=c("id","scaffold","start","stop","strand"))

# Join anchor points with positions
plot_data <- anchors %>%
    left_join(positions_sp1 %>% select(id, scaffold, start),
              by=c("gene_x"="id")) %>%
    rename(scaffold_x=scaffold, pos_x=start) %>%
    left_join(positions_sp2 %>% select(id, scaffold, start),
              by=c("gene_y"="id")) %>%
    rename(scaffold_y=scaffold, pos_y=start)

# Dot plot: each point is a syntenic gene pair
ggplot(plot_data, aes(x=pos_x/1e6, y=pos_y/1e6, color=scaffold_x)) +
    geom_point(size=0.5, alpha=0.7) +
    facet_grid(scaffold_y ~ scaffold_x, scales="free") +
    labs(x="H. glycines position (Mb)",
         y="G. pallida position (Mb)",
         title="Synteny: H. glycines vs G. pallida") +
    theme_minimal() +
    theme(legend.position="none",
          strip.text=element_text(size=6))

ggsave("synteny_dotplot.png", width=14, height=12, dpi=150)
cat("Synteny dot plot saved: synteny_dotplot.png\n")

Download:

Interpreting synteny dot plots: Diagonal lines indicate conserved collinear regions (same gene order). Off-diagonal dots indicate inversions or translocations. Dense diagonal blocks = large syntenic regions well-conserved between species. For closely related nematodes (same family), expect many syntenic blocks; for distantly related species, most synteny may be broken by repeat-driven rearrangements.

Troubleshooting

OrthoFinder: very large orthogroups containing hundreds of genes

This almost always indicates transposable element proteins (transposases, integrases, reverse transcriptases). Mask repeats before running OrthoFinder. Alternatively, filter orthogroups with unexpectedly high copy numbers (e.g., >3× the median gene copy per species).

i-ADHoRe: "No syntenic blocks detected"

Check: (1) Gene list files must be in iadhore_input/SPECIES/ and the INI file must list them correctly — even a typo in a path will silently produce no output. (2) Opscan ortholog file column order must match i-ADHoRe expectations (gene_sp1 TAB gene_sp2). (3) Try lowering anchor_points to 2 for distantly related species.

Opscan header format errors

Opscan is strict about the FASTA header format. Each protein header must have exactly 5 space-separated fields: species_geneName scaffold start stop strand. Scaffold names cannot contain spaces or special characters. Test with 10 proteins first: head -20 input.fasta | ./opscan /dev/stdin input.fasta.

Gene Orthology & Synteny Analysis

On this page

Overview

OrthoFinder / Opscan

i-ADHoRe

Ideal Directory Structure

Orthologs vs Paralogs vs Xenologs

Dataset

Step 1 — OrthoFinder: All-vs-All Ortholog Detection

Step 2 — Prepare Opscan Input Files

Step 3 — Run Opscan for Pairwise Orthologs

Step 4 — Filter Repetitive Genes Before Synteny

Step 5 — i-ADHoRe Synteny Block Detection

Step 6 — Synteny Dot Plots and Chord Diagrams

Troubleshooting

OrthoFinder: very large orthogroups containing hundreds of genes

i-ADHoRe: "No syntenic blocks detected"

Opscan header format errors