Infer orthologs and hierarchical orthologous groups (HOGs) across thousands of species — orders of magnitude faster than traditional OMA, with benchmarked accuracy from the Quest for Orthologs consortium.
Orthology is the relationship between genes in different species that descended from a common ancestor by speciation (not gene duplication). Identifying orthologs is foundational for:
FastOMA (Majidian et al., Nature Methods 2024) is a scalable reimplementation of the OMA (Orthologous MAtrix) algorithm. It uses a hierarchical approach — leveraging the OMA knowledge base and species tree topology — to infer orthologous groups orders of magnitude faster than the original, enabling genome-scale analyses across thousands of species.
| Term | Definition | Origin event |
|---|---|---|
| Orthologs | Same gene, different species, separated by speciation | Speciation |
| Paralogs | Gene copies within or across species from duplication | Gene duplication |
| Homeologs | Genes from polyploidization events (subgenomes) | Whole genome duplication |
HOGs are a richer representation than pairwise orthologs. A HOG at a given taxonomic level (e.g., Vertebrata) groups all genes that descended from a single ancestral gene at that level. This captures the full duplication/loss history of a gene family.
FastOMA outputs HOGs in OrthoXML format, which is the standard for orthology databases (OMA, OrthoFinder, PANTHER). It's parseable with Python's pyHam library for downstream HOG analysis.
FastOMA is installable via pip or conda. A Snakemake workflow is also available for large-scale runs.
# Create a dedicated environment conda create -n fastoma python=3.10 -y conda activate fastoma # Install FastOMA and dependencies pip install FastOMA # Verify installation fastoma --version
conda install -c bioconda fastoma -y
docker pull dessimozlab/fastoma:latest docker run -v $(pwd):/data dessimozlab/fastoma:latest fastoma --help
# FastOMA requires these tools in PATH: conda install -c bioconda diamond mmseqs2 mafft fasttree -y
FastOMA needs two things:
# Create project directory mkdir fastoma_project && cd fastoma_project mkdir proteomes # Download 5 example proteomes (Arabidopsis, Rice, Maize, Sorghum, Tomato) # Arabidopsis thaliana wget -O proteomes/ARATH.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:3702+AND+reviewed:true" # Oryza sativa (Rice) wget -O proteomes/ORYSJ.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:39947+AND+reviewed:true" # Or use any genome annotation protein output (e.g., MAKER .proteins.fasta) # Naming convention: species_code.fa (important for FastOMA parsing)
HUMAN.fa, MOUSE.fa). Sequence headers must start with >species_code|proteinID (e.g., >HUMAN|P53_HUMAN).
# If your headers are plain gene IDs, reformat them:
# Input: >AT1G01010.1 protein
# Output: >ARATH|AT1G01010.1
python3 - <<'EOF'
import sys, re
species_code = "ARATH"
with open("proteomes/arabidopsis_raw.fa") as f, open("proteomes/ARATH.fa", "w") as out:
for line in f:
if line.startswith(">"):
gene_id = line.strip().split()[0].lstrip(">")
out.write(f">{species_code}|{gene_id}\n")
else:
out.write(line)
print("Done reformatting headers")
EOF
A species tree in Newick format constrains HOG inference to biologically meaningful evolutionary relationships. You can use a published tree from TimeTree or NCBI taxonomy.
# Example species tree for 5 plant species (Newick format)
# Save as species_tree.nwk
cat > species_tree.nwk <<'EOF'
((((ARATH,LYCES),SOLLC),(ORYSJ,SORBI)),MAIZE);
EOF
# Verify tree format
python3 -c "
from ete3 import Tree
t = Tree('species_tree.nwk')
print('Species in tree:', [l.name for l in t.get_leaves()])
print('Tree looks valid!')
"
FastOMA is run in a single command. It will perform all-vs-all protein comparisons, compute pairwise orthologs, then hierarchically infer HOGs.
# Basic run — FastOMA auto-detects all .fa files in the proteomes/ folder fastoma \ --proteomes proteomes/ \ --species-tree species_tree.nwk \ --output-folder fastoma_output/ \ --threads 8 # With additional options for large datasets fastoma \ --proteomes proteomes/ \ --species-tree species_tree.nwk \ --output-folder fastoma_output/ \ --threads 32 \ --low-mem \ --aligner diamond \ # faster than default for large proteomes --min-length 50 \ # filter short proteins --verbose
| Dataset size | Threads | Approx. time |
|---|---|---|
| 5–10 species | 8 | 5–15 minutes |
| 50–100 species | 16 | 30–90 minutes |
| 500 species | 32 | 3–6 hours |
| 1,000+ species | 64 | 12–24 hours |
fastoma_output/ ├── OrthologousGroups.tsv # Pairwise orthologs (tab-separated) ├── HierarchicalGroups.orthoxml # HOGs in standard OrthoXML format ├── OrthologousMatrix.tsv # Presence/absence matrix per HOG per species ├── speciestree_checked.nwk # Validated/reconciled species tree ├── RootHOG_summary.tsv # Summary stats per HOG at root level └── logs/ # Detailed run logs
import pandas as pd
# Load pairwise orthologs
orthos = pd.read_csv("fastoma_output/OrthologousGroups.tsv", sep="\t",
names=["protein1","protein2","type","score"])
print(f"Total orthologous pairs: {len(orthos):,}")
print(orthos["type"].value_counts())
# type column values:
# 1-1: strict one-to-one orthologs (best for phylogenomics)
# 1-m: one-to-many (gene duplication in one lineage)
# m-1: many-to-one
# m-m: many-to-many (complex duplication history)
# Filter to 1-1 orthologs only
one_to_one = orthos[orthos["type"] == "1:1"]
print(f"\nOne-to-one orthologs: {len(one_to_one):,}")
# Example output:
# protein1 protein2 type score
# ARATH|AT1G01010 ORYSJ|Os01g0100 1:1 187.3
pyHam is the official Python library for parsing and visualising HOG evolution from FastOMA/OMA output.
pip install pyham
import pyham
# Load HOGs
ham = pyham.Ham(
tree_file="fastoma_output/speciestree_checked.nwk",
hog_file="fastoma_output/HierarchicalGroups.orthoxml",
use_internal_name=True
)
# Get all HOGs at root level
all_hogs = ham.get_list_top_level_hogs()
print(f"Total root HOGs: {len(all_hogs)}")
# Analyse a single HOG
hog = all_hogs[0]
print(f"HOG ID: {hog.hog_id}")
print(f"Genes: {[g.unique_id for g in hog.get_all_descendant_genes()]}")
# Count single-copy conserved HOGs (useful for species tree inference)
single_copy = [h for h in all_hogs if h.is_single_copy()]
print(f"Single-copy conserved HOGs: {len(single_copy)}")
# Get duplication/loss events
for hog in all_hogs[:5]:
events = ham.create_iHam(hog)
print(f"HOG {hog.hog_id}: duplications={events.count_duplication()}, "
f"losses={events.count_loss()}")
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load orthologous matrix (HOGs × species)
mat = pd.read_csv("fastoma_output/OrthologousMatrix.tsv", sep="\t", index_col=0)
# Replace gene IDs with 1/0 presence-absence
pa_mat = mat.applymap(lambda x: 0 if pd.isna(x) or x == "" else 1)
# Plot heatmap of gene family presence/absence
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(pa_mat.head(100),
cmap="Blues", linewidths=0.3,
cbar_kws={"label": "Gene present"},
ax=ax)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("HOG (top 100)", fontsize=12)
ax.set_title("Gene family presence/absence across species", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("gene_content_heatmap.pdf", dpi=300, bbox_inches="tight")
plt.show()
Single-copy orthologs inferred by FastOMA are ideal markers for species tree inference using concatenation or coalescent methods.
# Extract single-copy HOG sequences for phylogenomics
python3 - <<'EOF'
import pyham, os
ham = pyham.Ham("fastoma_output/speciestree_checked.nwk",
"fastoma_output/HierarchicalGroups.orthoxml",
use_internal_name=True)
os.makedirs("single_copy_fastas", exist_ok=True)
count = 0
for hog in ham.get_list_top_level_hogs():
if hog.is_single_copy() and len(hog.get_all_descendant_genes()) >= 4:
genes = hog.get_all_descendant_genes()
with open(f"single_copy_fastas/HOG{hog.hog_id}.fa", "w") as f:
for g in genes:
f.write(f">{g.genome.name}\n{g.sequence}\n")
count += 1
if count >= 200: # use top 200 for concatenation
break
print(f"Extracted {count} single-copy HOGs")
EOF
# Align each HOG with MAFFT
for f in single_copy_fastas/*.fa; do
mafft --auto --quiet "$f" > "${f%.fa}_aln.fa"
done
# Concatenate alignments for IQ-TREE
python3 - <<'EOF'
from Bio import SeqIO, AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import os, glob
alignments = sorted(glob.glob("single_copy_fastas/*_aln.fa"))
all_species = set()
for aln_file in alignments:
for rec in SeqIO.parse(aln_file, "fasta"):
all_species.add(rec.id)
concat = {sp: "" for sp in all_species}
for aln_file in alignments:
aln = {r.id: str(r.seq) for r in SeqIO.parse(aln_file, "fasta")}
aln_len = len(next(iter(aln.values())))
for sp in all_species:
concat[sp] += aln.get(sp, "-" * aln_len)
with open("concatenated_alignment.fa", "w") as f:
for sp, seq in concat.items():
f.write(f">{sp}\n{seq}\n")
print(f"Concatenated {len(alignments)} loci, {len(all_species)} species")
EOF
# Run IQ-TREE on concatenated alignment
iqtree2 -s concatenated_alignment.fa \
-m TEST \
-B 1000 \
-T AUTO \
--prefix species_tree_fastoma \
--quiet
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
import matplotlib.pyplot as plt
t = Tree("species_tree_fastoma.treefile")
# Style the tree
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = True
ts.branch_vertical_margin = 15
ts.scale = 100
for node in t.traverse():
ns = NodeStyle()
if node.is_leaf():
ns["size"] = 8
ns["fgcolor"] = "#2196F3"
else:
ns["size"] = 5
ns["fgcolor"] = "#FF5722"
node.set_style(ns)
t.render("species_tree.pdf", tree_style=ts, w=200, units="mm")
print("Tree saved to species_tree.pdf")
import matplotlib.pyplot as plt
import numpy as np
# Simulate stats from a real FastOMA run
species = ["ARATH","ORYSJ","SORBI","MAIZE","LYCES"]
duplications = [1243, 1876, 2341, 3102, 987]
losses = [456, 678, 890, 1203, 312]
x = np.arange(len(species))
width = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x - width/2, duplications, width, label="Duplications", color="#4CAF50", alpha=0.85)
bars2 = ax.bar(x + width/2, losses, width, label="Losses", color="#F44336", alpha=0.85)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("Gene family events", fontsize=12)
ax.set_title("HOG duplication and loss events per lineage", fontsize=14, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(species, rotation=20)
ax.legend()
ax.bar_label(bars1, padding=3, fontsize=9)
ax.bar_label(bars2, padding=3, fontsize=9)
plt.tight_layout()
plt.savefig("hog_events.pdf", dpi=300, bbox_inches="tight")
plt.show()
| Feature | FastOMA | OrthoFinder | OMA Standalone |
|---|---|---|---|
| Publication | Nature Methods 2024 | Genome Biology 2019 | Bioinformatics 2019 |
| Speed (1000 spp) | Hours | Days | Weeks |
| HOG inference | Yes (hierarchical) | Yes (rooted) | Yes (hierarchical) |
| OrthoXML output | Yes | No | Yes |
| Species tree input | Optional | Required for HOGs | Required |
| Quest for Orthologs accuracy | High (comparable to OMA) | High | Highest |
| Best for | Large-scale (100–10,000 spp) | 10–200 spp | <100 spp, maximum accuracy |