Gene Orthology & Synteny Analysis

Identify orthologous gene families across nematode genomes with OrthoFinder and Opscan, then detect conserved synteny blocks with i-ADHoRe — mapping the evolutionary history of genome architecture.

~120 min Advanced Bash
Byte

Overview

Comparative genomics reveals the evolutionary forces shaping genomes. Synteny analysis identifies chromosomal regions that retain the same gene order across species — evidence for shared ancestry. This tutorial covers two complementary approaches:

OrthoFinder / Opscan

All-vs-all protein BLAST to identify orthologous gene families across multiple genomes. Output: orthogroups, orthologs per species pair, gene duplication history.

i-ADHoRe

Detects co-linear (syntenic) blocks from ortholog gene lists. Identifies whole-genome duplications (WGD), segmental duplications, and conserved chromosomal regions across species.

Ideal Directory Structure

Comparative genomics projects involve large intermediate files from BLAST all-vs-all. Keep each tool's outputs in numbered directories so you can re-run any step independently.

comparative_genomics_nematodes/
comparative_genomics_nematodes/ ├── proteins/ # one protein FASTA per species (OrthoFinder input) │ ├── H_glycines.fa │ ├── G_pallida.fa │ └── M_hapla.fa ├── 1_orthofinder/ # OrthoFinder all-vs-all BLAST + orthogroups │ └── Orthogroups.tsv # main output: orthogroups × species matrix ├── 2_opscan/ # Opscan pairwise ortholog detection │ ├── input/ # positionally-formatted FASTA files │ └── Hgly_vs_Gpall.orthologs.txt ├── 3_repeatmasker/ # repeat masking for repeat gene filtering ├── 4_iadhore/ # i-ADHoRe synteny detection │ ├── iadhore.ini # configuration file │ ├── input/ # per-scaffold gene order lists │ └── anchorpoints.txt # collinear gene pairs (key output) └── 5_visualization/ # synteny plots (dotplots, chord diagrams) └── synteny_dotplot.png
⏱️
Runtime reality check: OrthoFinder DIAMOND all-vs-all for 3 nematode genomes (~20,000 proteins each) takes ~2–6 hours on 16 threads. The BLAST phase is the bottleneck. Use -S diamond (10× faster than BLAST) or -S diamond_ultra_sens for maximum sensitivity. Store Results_*/WorkingDirectory/ in case you need to resume a failed run.

Orthologs vs Paralogs vs Xenologs

TermDefinitionEvolutionary event
OrthologsSame gene in different species, related by speciationSpeciation (vertical inheritance)
ParalogsDuplicated copies of a gene within the same or different speciesGene duplication
XenologsGenes related by horizontal gene transferHGT
OrthogroupGroup of genes descended from a single gene in the last common ancestorIncludes orthologs and paralogs
Byte
Why filter repetitive elements? Transposable elements and other repeats encode proteins (transposases, reverse transcriptases) that BLAST readily against each other. If included, they create enormous spurious orthogroups swamping the real signal. Always soft-mask or hard-mask repeats before running OrthoFinder or Opscan.

Dataset

Three plant-parasitic nematode genomes from WormBase ParaSite — all well-annotated, publicly available:

SpeciesCommon nameGenome sizeSource
Heterodera glycinesSoybean cyst nematode93 MbWormBase ParaSite
Globodera pallidaPotato cyst nematode124 MbWormBase ParaSite
Meloidogyne haplaRoot-knot nematode54 MbWormBase ParaSite

Step 1 — OrthoFinder: All-vs-All Ortholog Detection

Bash — OrthoFinder setup and run
conda install -c bioconda orthofinder diamond
# or: module load orthofinder/2.5.2

mkdir -p proteins

# Download protein FASTA from WormBase ParaSite for each species
wget "https://parasite.wormbase.org/ftp.html" -q  # see FTP directory for paths
# Example paths (actual URL structure):
# ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/current/species/
#   heterodera_glycines/PRJNA381081/heterodera_glycines.PRJNA381081.WBPS18.protein.fa.gz
#   globodera_pallida/PRJEB123/globodera_pallida.PRJEB123.WBPS18.protein.fa.gz
#   meloidogyne_hapla/PRJNA29083/meloidogyne_hapla.PRJNA29083.WBPS18.protein.fa.gz

# Decompress and rename FASTA files
# OrthoFinder uses filename as species name
gzip -d *.gz
mv heterodera_glycines.*.protein.fa proteins/H_glycines.fa
mv globodera_pallida.*.protein.fa proteins/G_pallida.fa
mv meloidogyne_hapla.*.protein.fa proteins/M_hapla.fa

# Verify
for f in proteins/*.fa; do echo "$f: $(grep -c '>' $f) proteins"; done

# Run OrthoFinder
# -f = directory of protein FASTA files (one per species)
# -t = threads for BLAST
# -a = threads for OrthoFinder analysis
# -M msa = use multiple sequence alignment (more accurate)
orthofinder \
    -f proteins/ \
    -t 16 \
    -a 8 \
    -S diamond \
    -o orthofinder_results/

# Key output files:
ls orthofinder_results/Results_*/
# Orthogroups/Orthogroups.tsv                 -- main orthogroup table
# Orthogroups/Orthogroups_SingleCopyOrthologues.txt
# Phylogenetically_Misplaced_Genes/
# Orthogroup_Sequences/                        -- FASTA per orthogroup
Download:
OrthoFinder flags decoded
📁-f proteins/Input folder containing one protein FASTA per species. OrthoFinder uses the filename as the species name in all outputs — choose meaningful filenames like H_glycines.fa rather than the original download names.
🧵-t 16 / -a 8-t = threads for the all-vs-all protein comparison step (DIAMOND). -a = threads for the OrthoFinder analysis step (MCL clustering, gene tree inference). The DIAMOND step is the bottleneck, so give it more threads.
-S diamondUse DIAMOND instead of BLAST for the all-vs-all protein comparison. ~10× faster with comparable accuracy for orthogroup detection. Use -S blast if you need maximum sensitivity for highly diverged sequences.
📈Orthogroups.tsvThe main output: a tab-separated matrix of orthogroup ID vs each species, listing all gene members. Also see Orthogroups_SingleCopyOrthologues.txt for 1:1 orthologs — ideal for phylogenomics.
🌳
OrthoFinder's algorithm: It does all-vs-all DIAMOND, normalizes scores by sequence length and search space, applies RBH (reciprocal best-hit) per species pair, then uses MCL (Markov Clustering) to group sequences into orthogroups. It also infers a species tree and root gene trees to properly separate orthologs from paralogs.

Step 2 — Prepare Opscan Input Files

Opscan takes protein FASTA and GFF files to build gene-order lists needed for synteny detection:

Bash — Format FASTA headers for Opscan
mkdir -p opscan_input

# Opscan expects FASTA with headers in specific format:
# >species_geneName scaffold_name start stop strand
# Example: >Hgly_mRNA1 scaffold_1 100 500 +

# Use bioawk to reformat headers from GFF + FASTA
# This extracts protein ID, scaffold, coordinates, and strand from GFF
module load bioawk

GFF="H_glycines.gff3"
PROT="proteins/H_glycines.fa"

# Extract gene positions from GFF
grep $'\tmRNA\t' ${GFF} | awk '{
    split($9, a, ";")
    for (i in a) {
        if (a[i] ~ /^ID=/) {
            gsub("ID=", "", a[i])
            id = a[i]
        }
    }
    print id, $1, $4, $5, $7
}' OFS='\t' > Hgly_gene_positions.tsv

# Reformat protein FASTA with positional headers
python3 << 'EOF'
import pandas as pd
from Bio import SeqIO

positions = pd.read_csv("Hgly_gene_positions.tsv", sep="\t",
                         names=["id","scaffold","start","stop","strand"])
pos_dict = positions.set_index("id").to_dict("index")

with open("opscan_input/H_glycines.fasta", "w") as out:
    for rec in SeqIO.parse("proteins/H_glycines.fa", "fasta"):
        gid = rec.id.split(".")[0]  # strip isoform suffix
        if gid in pos_dict:
            p = pos_dict[gid]
            header = f"Hgly_{gid} {p['scaffold']} {p['start']} {p['stop']} {p['strand']}"
            out.write(f">{header}\n{str(rec.seq)}\n")
EOF
echo "Opscan input prepared: opscan_input/H_glycines.fasta"
head -4 opscan_input/H_glycines.fasta
Download:

Step 3 — Run Opscan for Pairwise Orthologs

Bash — Opscan pairwise ortholog detection
# Download Opscan (free, from iADHoRe supplementary)
wget http://bioinformatics.psb.ugent.be/downloads/psb/iADHoRe/opscan.tar.gz
tar xzf opscan.tar.gz
cd opscan/ && make && cd ..

OPSCAN="opscan/opscan"
INDIR="opscan_input"
OUTDIR="opscan_results"
mkdir -p ${OUTDIR}

# Run all pairwise combinations
for sp1 in H_glycines G_pallida M_hapla; do
    for sp2 in H_glycines G_pallida M_hapla; do
        if [ "${sp1}" != "${sp2}" ]; then
            PAIR="${sp1}_vs_${sp2}"
            echo "Running Opscan: ${PAIR}"

            ${OPSCAN} \
                ${INDIR}/${sp1}.fasta \
                ${INDIR}/${sp2}.fasta \
                > ${OUTDIR}/${PAIR}.orthologs.txt

            echo "  Ortholog pairs: $(wc -l < ${OUTDIR}/${PAIR}.orthologs.txt)"
        fi
    done
done

# Opscan output format (tab-separated):
# gene1_sp1  gene1_sp2  bitscore
# Hgly_mRNA1  Gpall_mRNA42  287
head -5 ${OUTDIR}/H_glycines_vs_G_pallida.orthologs.txt
Download:
Opscan and gene position formatting decoded
🏷️FASTA header formatOpscan requires headers in >prefix_geneName scaffold start stop strand format. The prefix_ (e.g. Hgly_) identifies which species the gene belongs to in the output. Positional info allows Opscan to build the gene-order map for synteny detection.
🔄All pairwise combinationsOpscan runs pairwise (not multi-species at once). For N species you need N×(N-1) runs. For 3 species: 6 pairs. The nested loop handles this automatically. Each run produces a bidirectional ortholog list.
📊Output: gene1 gene2 bitscoreEach line is an ortholog pair with BLAST bitscore as confidence metric. Higher bitscore = better match. These pairs feed directly into i-ADHoRe to find collinear runs.

Step 4 — Filter Repetitive Genes Before Synteny

Bash — Remove repeat-associated genes from ortholog lists
# Run RepeatMasker to identify repetitive elements
module load repeatmasker

RepeatMasker \
    -pa 16 \
    -species nematoda \
    -gff \
    -dir repeatmasker_out/ \
    genome_assembly.fasta

# Get coordinates of repeat-overlapping genes from RepeatMasker GFF
REPEAT_GFF="repeatmasker_out/genome_assembly.fasta.out.gff"

# Use bedtools to identify genes overlapping repeats
module load bedtools

# Convert gene GFF to BED
grep $'\tmRNA\t' H_glycines.gff3 | \
    awk '{print $1, $4-1, $5, $9, ".", $7}' OFS='\t' \
    > genes.bed

# Get repeat coordinates as BED
awk '/^[^#]/ {print $1, $4-1, $5}' OFS='\t' ${REPEAT_GFF} > repeats.bed

# Find genes with >50% overlap with repeats
bedtools coverage \
    -a genes.bed \
    -b repeats.bed \
    | awk '$NF > 0.50 {print $4}' \
    | sed 's/ID=//' \
    > repeat_gene_ids.txt

echo "Genes overlapping repeats (>50%): $(wc -l < repeat_gene_ids.txt)"

# Remove these from Opscan outputs
for ortho in opscan_results/*.orthologs.txt; do
    OUT="${ortho%.orthologs.txt}.filtered.txt"
    grep -v -f repeat_gene_ids.txt ${ortho} > ${OUT}
    echo "Filtered: $(wc -l < ${ortho}) -> $(wc -l < ${OUT}) pairs"
done
Download:

Step 5 — i-ADHoRe Synteny Block Detection

Bash — i-ADHoRe configuration and run
mkdir -p iadhore_input iadhore_output

# For each genome, create per-scaffold gene list files
# i-ADHoRe needs: gene order files, one file per scaffold
# Format: gene_name orientation (+ or -)

for SPECIES in H_glycines G_pallida M_hapla; do
    SPECIES_DIR="iadhore_input/${SPECIES}"
    mkdir -p ${SPECIES_DIR}

    awk -v sp="${SPECIES}" '
    /^\t/ { next }  # skip header
    {
        print $2, $5 > (SPECIES_DIR "/" $1 ".lst")
        # Creates one file per scaffold with gene names and strands
    }' gene_positions_${SPECIES}.tsv
done

# Write i-ADHoRe configuration file
cat > iadhore.ini << 'EOF'
# i-ADHoRe 3.0 configuration file

# Genome 1: H. glycines
genome= H_glycines
H_glycines/scaffold_1.lst
H_glycines/scaffold_2.lst
# ... (list all scaffold files)

# Genome 2: G. pallida
genome= G_pallida
G_pallida/scaffold_1.lst
G_pallida/scaffold_2.lst

# Ortholog table (from Opscan output)
blast_table= opscan_results/H_glycines_vs_G_pallida.filtered.txt

# Output directory
output_path= iadhore_output/

# Parameters
gap_size= 30         # max gap between syntenic genes
cluster_gap= 35      # gap for merging clusters into blocks
q_value= 0.75        # FDR threshold
anchor_points= 3     # minimum collinear gene pairs
alignment_method= gg2  # greedy graph
max_gaps_in_alignment= 40
EOF

# Run i-ADHoRe
iadhore iadhore.ini

echo "i-ADHoRe complete. Synteny blocks:"
wc -l iadhore_output/anchorpoints.txt
echo "Output files:"
ls iadhore_output/
Download:

Step 6 — Synteny Dot Plots and Chord Diagrams

R — Synteny dot plot from i-ADHoRe anchorpoints
library(ggplot2)
library(dplyr)

# Load i-ADHoRe anchor points (syntenic gene pairs)
anchors <- read.table("iadhore_output/anchorpoints.txt",
                       header=TRUE, sep="\t")
head(anchors)

# Get gene positions for dot plot
positions_sp1 <- read.table("gene_positions_H_glycines.tsv",
                              col.names=c("id","scaffold","start","stop","strand"))
positions_sp2 <- read.table("gene_positions_G_pallida.tsv",
                              col.names=c("id","scaffold","start","stop","strand"))

# Join anchor points with positions
plot_data <- anchors %>%
    left_join(positions_sp1 %>% select(id, scaffold, start),
              by=c("gene_x"="id")) %>%
    rename(scaffold_x=scaffold, pos_x=start) %>%
    left_join(positions_sp2 %>% select(id, scaffold, start),
              by=c("gene_y"="id")) %>%
    rename(scaffold_y=scaffold, pos_y=start)

# Dot plot: each point is a syntenic gene pair
ggplot(plot_data, aes(x=pos_x/1e6, y=pos_y/1e6, color=scaffold_x)) +
    geom_point(size=0.5, alpha=0.7) +
    facet_grid(scaffold_y ~ scaffold_x, scales="free") +
    labs(x="H. glycines position (Mb)",
         y="G. pallida position (Mb)",
         title="Synteny: H. glycines vs G. pallida") +
    theme_minimal() +
    theme(legend.position="none",
          strip.text=element_text(size=6))

ggsave("synteny_dotplot.png", width=14, height=12, dpi=150)
cat("Synteny dot plot saved: synteny_dotplot.png\n")
Download:
Interpreting synteny dot plots: Diagonal lines indicate conserved collinear regions (same gene order). Off-diagonal dots indicate inversions or translocations. Dense diagonal blocks = large syntenic regions well-conserved between species. For closely related nematodes (same family), expect many syntenic blocks; for distantly related species, most synteny may be broken by repeat-driven rearrangements.

Troubleshooting

OrthoFinder: very large orthogroups containing hundreds of genes

This almost always indicates transposable element proteins (transposases, integrases, reverse transcriptases). Mask repeats before running OrthoFinder. Alternatively, filter orthogroups with unexpectedly high copy numbers (e.g., >3× the median gene copy per species).

i-ADHoRe: "No syntenic blocks detected"

Check: (1) Gene list files must be in iadhore_input/SPECIES/ and the INI file must list them correctly — even a typo in a path will silently produce no output. (2) Opscan ortholog file column order must match i-ADHoRe expectations (gene_sp1 TAB gene_sp2). (3) Try lowering anchor_points to 2 for distantly related species.

Opscan header format errors

Opscan is strict about the FASTA header format. Each protein header must have exactly 5 space-separated fields: species_geneName scaffold start stop strand. Scaffold names cannot contain spaces or special characters. Test with 10 proteins first: head -20 input.fasta | ./opscan /dev/stdin input.fasta.