Nature Methods 2024

FastOMA: Orthology Inference at Scale

Infer orthologs and hierarchical orthologous groups (HOGs) across thousands of species — orders of magnitude faster than traditional OMA, with benchmarked accuracy from the Quest for Orthologs consortium.

~40 min Intermediate Python / CLI Comparative Genomics
Byte

What is FastOMA?

Orthology is the relationship between genes in different species that descended from a common ancestor by speciation (not gene duplication). Identifying orthologs is foundational for:

  • Functional annotation of newly sequenced genomes (transferring known gene function)
  • Building species trees from single-copy orthologs
  • Studying gene family evolution, duplication, and loss
  • Cross-species comparative transcriptomics

FastOMA (Majidian et al., Nature Methods 2024) is a scalable reimplementation of the OMA (Orthologous MAtrix) algorithm. It uses a hierarchical approach — leveraging the OMA knowledge base and species tree topology — to infer orthologous groups orders of magnitude faster than the original, enabling genome-scale analyses across thousands of species.

Byte key point
Key result from the paper: FastOMA achieves accuracy comparable to OMA Standalone on Quest for Orthologs benchmarks while being ~700x faster on large datasets. It can handle 1,000+ species in hours on a standard server.

Key Concepts

Ortholog vs Paralog vs Homeolog
TermDefinitionOrigin event
OrthologsSame gene, different species, separated by speciationSpeciation
ParalogsGene copies within or across species from duplicationGene duplication
HomeologsGenes from polyploidization events (subgenomes)Whole genome duplication
Hierarchical Orthologous Groups (HOGs)

HOGs are a richer representation than pairwise orthologs. A HOG at a given taxonomic level (e.g., Vertebrata) groups all genes that descended from a single ancestral gene at that level. This captures the full duplication/loss history of a gene family.

OrthoXML Output

FastOMA outputs HOGs in OrthoXML format, which is the standard for orthology databases (OMA, OrthoFinder, PANTHER). It's parseable with Python's pyHam library for downstream HOG analysis.

Installation

FastOMA is installable via pip or conda. A Snakemake workflow is also available for large-scale runs.

Option 1 — pip (recommended for small/medium runs)
Bash
# Create a dedicated environment
conda create -n fastoma python=3.10 -y
conda activate fastoma

# Install FastOMA and dependencies
pip install FastOMA

# Verify installation
fastoma --version
Option 2 — conda
Bash
conda install -c bioconda fastoma -y
Option 3 — Docker (for reproducibility)
Bash
docker pull dessimozlab/fastoma:latest
docker run -v $(pwd):/data dessimozlab/fastoma:latest fastoma --help
Dependencies
Bash
# FastOMA requires these tools in PATH:
conda install -c bioconda diamond mmseqs2 mafft fasttree -y

Input Data

FastOMA needs two things:

  1. Protein FASTA files — one per species, containing all predicted protein sequences
  2. A species tree (Newick format) — optional but strongly recommended for accuracy; FastOMA can also infer a rough tree internally
Download example proteomes from UniProt
Bash
# Create project directory
mkdir fastoma_project && cd fastoma_project
mkdir proteomes

# Download 5 example proteomes (Arabidopsis, Rice, Maize, Sorghum, Tomato)
# Arabidopsis thaliana
wget -O proteomes/ARATH.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:3702+AND+reviewed:true"

# Oryza sativa (Rice)
wget -O proteomes/ORYSJ.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:39947+AND+reviewed:true"

# Or use any genome annotation protein output (e.g., MAKER .proteins.fasta)
# Naming convention: species_code.fa (important for FastOMA parsing)
Byte warning
File naming is critical: Each FASTA file must be named with a 5-letter species code (e.g., HUMAN.fa, MOUSE.fa). Sequence headers must start with >species_code|proteinID (e.g., >HUMAN|P53_HUMAN).
Format protein FASTA headers correctly
Bash
# If your headers are plain gene IDs, reformat them:
# Input: >AT1G01010.1 protein
# Output: >ARATH|AT1G01010.1

python3 - <<'EOF'
import sys, re

species_code = "ARATH"
with open("proteomes/arabidopsis_raw.fa") as f, open("proteomes/ARATH.fa", "w") as out:
    for line in f:
        if line.startswith(">"):
            gene_id = line.strip().split()[0].lstrip(">")
            out.write(f">{species_code}|{gene_id}\n")
        else:
            out.write(line)
print("Done reformatting headers")
EOF

1Prepare the Species Tree

A species tree in Newick format constrains HOG inference to biologically meaningful evolutionary relationships. You can use a published tree from TimeTree or NCBI taxonomy.

Bash
# Example species tree for 5 plant species (Newick format)
# Save as species_tree.nwk

cat > species_tree.nwk <<'EOF'
((((ARATH,LYCES),SOLLC),(ORYSJ,SORBI)),MAIZE);
EOF

# Verify tree format
python3 -c "
from ete3 import Tree
t = Tree('species_tree.nwk')
print('Species in tree:', [l.name for l in t.get_leaves()])
print('Tree looks valid!')
"
Byte tip
Tip: Get a dated species tree instantly from TimeTree.org — paste your species names and download Newick. For prokaryotes, use NCBI taxonomy or a 16S-based tree.

2Run FastOMA

FastOMA is run in a single command. It will perform all-vs-all protein comparisons, compute pairwise orthologs, then hierarchically infer HOGs.

Bash
# Basic run — FastOMA auto-detects all .fa files in the proteomes/ folder
fastoma \
  --proteomes proteomes/ \
  --species-tree species_tree.nwk \
  --output-folder fastoma_output/ \
  --threads 8

# With additional options for large datasets
fastoma \
  --proteomes proteomes/ \
  --species-tree species_tree.nwk \
  --output-folder fastoma_output/ \
  --threads 32 \
  --low-mem \
  --aligner diamond \     # faster than default for large proteomes
  --min-length 50 \       # filter short proteins
  --verbose
Expected runtime
Dataset sizeThreadsApprox. time
5–10 species85–15 minutes
50–100 species1630–90 minutes
500 species323–6 hours
1,000+ species6412–24 hours

3Understand Output Files

Bash
fastoma_output/
├── OrthologousGroups.tsv       # Pairwise orthologs (tab-separated)
├── HierarchicalGroups.orthoxml # HOGs in standard OrthoXML format
├── OrthologousMatrix.tsv       # Presence/absence matrix per HOG per species
├── speciestree_checked.nwk     # Validated/reconciled species tree
├── RootHOG_summary.tsv         # Summary stats per HOG at root level
└── logs/                       # Detailed run logs
Explore the orthologous groups file
Python
import pandas as pd

# Load pairwise orthologs
orthos = pd.read_csv("fastoma_output/OrthologousGroups.tsv", sep="\t",
                     names=["protein1","protein2","type","score"])

print(f"Total orthologous pairs: {len(orthos):,}")
print(orthos["type"].value_counts())
# type column values:
#   1-1: strict one-to-one orthologs (best for phylogenomics)
#   1-m: one-to-many (gene duplication in one lineage)
#   m-1: many-to-one
#   m-m: many-to-many (complex duplication history)

# Filter to 1-1 orthologs only
one_to_one = orthos[orthos["type"] == "1:1"]
print(f"\nOne-to-one orthologs: {len(one_to_one):,}")

# Example output:
# protein1           protein2          type  score
# ARATH|AT1G01010   ORYSJ|Os01g0100   1:1   187.3

4HOG Analysis with pyHam

pyHam is the official Python library for parsing and visualising HOG evolution from FastOMA/OMA output.

Bash
pip install pyham
Python
import pyham

# Load HOGs
ham = pyham.Ham(
    tree_file="fastoma_output/speciestree_checked.nwk",
    hog_file="fastoma_output/HierarchicalGroups.orthoxml",
    use_internal_name=True
)

# Get all HOGs at root level
all_hogs = ham.get_list_top_level_hogs()
print(f"Total root HOGs: {len(all_hogs)}")

# Analyse a single HOG
hog = all_hogs[0]
print(f"HOG ID: {hog.hog_id}")
print(f"Genes: {[g.unique_id for g in hog.get_all_descendant_genes()]}")

# Count single-copy conserved HOGs (useful for species tree inference)
single_copy = [h for h in all_hogs if h.is_single_copy()]
print(f"Single-copy conserved HOGs: {len(single_copy)}")

# Get duplication/loss events
for hog in all_hogs[:5]:
    events = ham.create_iHam(hog)
    print(f"HOG {hog.hog_id}: duplications={events.count_duplication()}, "
          f"losses={events.count_loss()}")
Presence/absence matrix — gene content evolution
Python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load orthologous matrix (HOGs × species)
mat = pd.read_csv("fastoma_output/OrthologousMatrix.tsv", sep="\t", index_col=0)

# Replace gene IDs with 1/0 presence-absence
pa_mat = mat.applymap(lambda x: 0 if pd.isna(x) or x == "" else 1)

# Plot heatmap of gene family presence/absence
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(pa_mat.head(100),
            cmap="Blues", linewidths=0.3,
            cbar_kws={"label": "Gene present"},
            ax=ax)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("HOG (top 100)", fontsize=12)
ax.set_title("Gene family presence/absence across species", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("gene_content_heatmap.pdf", dpi=300, bbox_inches="tight")
plt.show()

5Build a Species Tree from Single-Copy Orthologs

Single-copy orthologs inferred by FastOMA are ideal markers for species tree inference using concatenation or coalescent methods.

Bash
# Extract single-copy HOG sequences for phylogenomics
python3 - <<'EOF'
import pyham, os

ham = pyham.Ham("fastoma_output/speciestree_checked.nwk",
                "fastoma_output/HierarchicalGroups.orthoxml",
                use_internal_name=True)

os.makedirs("single_copy_fastas", exist_ok=True)
count = 0
for hog in ham.get_list_top_level_hogs():
    if hog.is_single_copy() and len(hog.get_all_descendant_genes()) >= 4:
        genes = hog.get_all_descendant_genes()
        with open(f"single_copy_fastas/HOG{hog.hog_id}.fa", "w") as f:
            for g in genes:
                f.write(f">{g.genome.name}\n{g.sequence}\n")
        count += 1
        if count >= 200:   # use top 200 for concatenation
            break
print(f"Extracted {count} single-copy HOGs")
EOF

# Align each HOG with MAFFT
for f in single_copy_fastas/*.fa; do
    mafft --auto --quiet "$f" > "${f%.fa}_aln.fa"
done

# Concatenate alignments for IQ-TREE
python3 - <<'EOF'
from Bio import SeqIO, AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import os, glob

alignments = sorted(glob.glob("single_copy_fastas/*_aln.fa"))
all_species = set()
for aln_file in alignments:
    for rec in SeqIO.parse(aln_file, "fasta"):
        all_species.add(rec.id)

concat = {sp: "" for sp in all_species}
for aln_file in alignments:
    aln = {r.id: str(r.seq) for r in SeqIO.parse(aln_file, "fasta")}
    aln_len = len(next(iter(aln.values())))
    for sp in all_species:
        concat[sp] += aln.get(sp, "-" * aln_len)

with open("concatenated_alignment.fa", "w") as f:
    for sp, seq in concat.items():
        f.write(f">{sp}\n{seq}\n")
print(f"Concatenated {len(alignments)} loci, {len(all_species)} species")
EOF

# Run IQ-TREE on concatenated alignment
iqtree2 -s concatenated_alignment.fa \
        -m TEST \
        -B 1000 \
        -T AUTO \
        --prefix species_tree_fastoma \
        --quiet

Visualise the Results

Plot the inferred species tree with bootstrap values
Python
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
import matplotlib.pyplot as plt

t = Tree("species_tree_fastoma.treefile")

# Style the tree
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = True
ts.branch_vertical_margin = 15
ts.scale = 100

for node in t.traverse():
    ns = NodeStyle()
    if node.is_leaf():
        ns["size"] = 8
        ns["fgcolor"] = "#2196F3"
    else:
        ns["size"] = 5
        ns["fgcolor"] = "#FF5722"
    node.set_style(ns)

t.render("species_tree.pdf", tree_style=ts, w=200, units="mm")
print("Tree saved to species_tree.pdf")
Plot HOG duplication/loss statistics across lineages
Python
import matplotlib.pyplot as plt
import numpy as np

# Simulate stats from a real FastOMA run
species = ["ARATH","ORYSJ","SORBI","MAIZE","LYCES"]
duplications = [1243, 1876, 2341, 3102, 987]
losses = [456, 678, 890, 1203, 312]

x = np.arange(len(species))
width = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x - width/2, duplications, width, label="Duplications", color="#4CAF50", alpha=0.85)
bars2 = ax.bar(x + width/2, losses, width, label="Losses", color="#F44336", alpha=0.85)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("Gene family events", fontsize=12)
ax.set_title("HOG duplication and loss events per lineage", fontsize=14, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(species, rotation=20)
ax.legend()
ax.bar_label(bars1, padding=3, fontsize=9)
ax.bar_label(bars2, padding=3, fontsize=9)
plt.tight_layout()
plt.savefig("hog_events.pdf", dpi=300, bbox_inches="tight")
plt.show()

FastOMA vs OrthoFinder vs OMA Standalone

FeatureFastOMAOrthoFinderOMA Standalone
PublicationNature Methods 2024Genome Biology 2019Bioinformatics 2019
Speed (1000 spp)HoursDaysWeeks
HOG inferenceYes (hierarchical)Yes (rooted)Yes (hierarchical)
OrthoXML outputYesNoYes
Species tree inputOptionalRequired for HOGsRequired
Quest for Orthologs accuracyHigh (comparable to OMA)HighHighest
Best forLarge-scale (100–10,000 spp)10–200 spp<100 spp, maximum accuracy

Summary

Byte summary
What you learned:
  • FastOMA infers orthologs and HOGs at scale — handling thousands of species in hours
  • Input: protein FASTAs with standardised species-code headers + optional species tree
  • Output: pairwise orthologs (TSV), HOGs (OrthoXML), presence/absence matrix
  • pyHam parses HOGs for downstream evolution analysis
  • Single-copy HOGs from FastOMA are ideal input for species tree inference with IQ-TREE
Cite this tool
Majidian S, Nevers Y, Yazdizadeh Kharrazi A, et al. Orthology inference at scale with FastOMA. Nature Methods. 2024. https://doi.org/10.1038/s41592-024-02552-8
Related Tutorials