FastOMA Orthology Inference at Scale — Tutorial

What is FastOMA?

Orthology is the relationship between genes in different species that descended from a common ancestor by speciation (not gene duplication). Identifying orthologs is foundational for:

Functional annotation of newly sequenced genomes (transferring known gene function)
Building species trees from single-copy orthologs
Studying gene family evolution, duplication, and loss
Cross-species comparative transcriptomics

FastOMA (Majidian et al., Nature Methods 2024) is a scalable reimplementation of the OMA (Orthologous MAtrix) algorithm. It uses a hierarchical approach — leveraging the OMA knowledge base and species tree topology — to infer orthologous groups orders of magnitude faster than the original, enabling genome-scale analyses across thousands of species.

Key result from the paper: FastOMA achieves accuracy comparable to OMA Standalone on Quest for Orthologs benchmarks while being ~700x faster on large datasets. It can handle 1,000+ species in hours on a standard server.

Key Concepts

Ortholog vs Paralog vs Homeolog

Term	Definition	Origin event
Orthologs	Same gene, different species, separated by speciation	Speciation
Paralogs	Gene copies within or across species from duplication	Gene duplication
Homeologs	Genes from polyploidization events (subgenomes)	Whole genome duplication

Hierarchical Orthologous Groups (HOGs)

HOGs are a richer representation than pairwise orthologs. A HOG at a given taxonomic level (e.g., Vertebrata) groups all genes that descended from a single ancestral gene at that level. This captures the full duplication/loss history of a gene family.

OrthoXML Output

FastOMA outputs HOGs in OrthoXML format, which is the standard for orthology databases (OMA, OrthoFinder, PANTHER). It's parseable with Python's pyHam library for downstream HOG analysis.

Installation

FastOMA is installable via pip or conda. A Snakemake workflow is also available for large-scale runs.

Option 1 — pip (recommended for small/medium runs)

Bash

# Create a dedicated environment
conda create -n fastoma python=3.10 -y
conda activate fastoma

# Install FastOMA and dependencies
pip install FastOMA

# Verify installation
fastoma --version

Option 2 — conda

Bash

conda install -c bioconda fastoma -y

Option 3 — Docker (for reproducibility)

Bash

docker pull dessimozlab/fastoma:latest
docker run -v $(pwd):/data dessimozlab/fastoma:latest fastoma --help

Dependencies

Bash

# FastOMA requires these tools in PATH:
conda install -c bioconda diamond mmseqs2 mafft fasttree -y

Input Data

FastOMA needs two things:

Protein FASTA files — one per species, containing all predicted protein sequences
A species tree (Newick format) — optional but strongly recommended for accuracy; FastOMA can also infer a rough tree internally

Download example proteomes from UniProt

Bash

# Create project directory
mkdir fastoma_project && cd fastoma_project
mkdir proteomes

# Download 5 example proteomes (Arabidopsis, Rice, Maize, Sorghum, Tomato)
# Arabidopsis thaliana
wget -O proteomes/ARATH.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:3702+AND+reviewed:true"

# Oryza sativa (Rice)
wget -O proteomes/ORYSJ.fa "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=organism_id:39947+AND+reviewed:true"

# Or use any genome annotation protein output (e.g., MAKER .proteins.fasta)
# Naming convention: species_code.fa (important for FastOMA parsing)

File naming is critical: Each FASTA file must be named with a 5-letter species code (e.g., HUMAN.fa, MOUSE.fa). Sequence headers must start with >species_code|proteinID (e.g., >HUMAN|P53_HUMAN).

Format protein FASTA headers correctly

Bash

# If your headers are plain gene IDs, reformat them:
# Input: >AT1G01010.1 protein
# Output: >ARATH|AT1G01010.1

python3 - <<'EOF'
import sys, re

species_code = "ARATH"
with open("proteomes/arabidopsis_raw.fa") as f, open("proteomes/ARATH.fa", "w") as out:
    for line in f:
        if line.startswith(">"):
            gene_id = line.strip().split()[0].lstrip(">")
            out.write(f">{species_code}|{gene_id}\n")
        else:
            out.write(line)
print("Done reformatting headers")
EOF

1Prepare the Species Tree

A species tree in Newick format constrains HOG inference to biologically meaningful evolutionary relationships. You can use a published tree from TimeTree or NCBI taxonomy.

Bash

# Example species tree for 5 plant species (Newick format)
# Save as species_tree.nwk

cat > species_tree.nwk <<'EOF'
((((ARATH,LYCES),SOLLC),(ORYSJ,SORBI)),MAIZE);
EOF

# Verify tree format
python3 -c "
from ete3 import Tree
t = Tree('species_tree.nwk')
print('Species in tree:', [l.name for l in t.get_leaves()])
print('Tree looks valid!')
"

Tip: Get a dated species tree instantly from TimeTree.org — paste your species names and download Newick. For prokaryotes, use NCBI taxonomy or a 16S-based tree.

2Run FastOMA

FastOMA is run in a single command. It will perform all-vs-all protein comparisons, compute pairwise orthologs, then hierarchically infer HOGs.

Bash

# Basic run — FastOMA auto-detects all .fa files in the proteomes/ folder
fastoma \
  --proteomes proteomes/ \
  --species-tree species_tree.nwk \
  --output-folder fastoma_output/ \
  --threads 8

# With additional options for large datasets
fastoma \
  --proteomes proteomes/ \
  --species-tree species_tree.nwk \
  --output-folder fastoma_output/ \
  --threads 32 \
  --low-mem \
  --aligner diamond \     # faster than default for large proteomes
  --min-length 50 \       # filter short proteins
  --verbose

Expected runtime

Dataset size	Threads	Approx. time
5–10 species	8	5–15 minutes
50–100 species	16	30–90 minutes
500 species	32	3–6 hours
1,000+ species	64	12–24 hours

3Understand Output Files

Bash

fastoma_output/
├── OrthologousGroups.tsv       # Pairwise orthologs (tab-separated)
├── HierarchicalGroups.orthoxml # HOGs in standard OrthoXML format
├── OrthologousMatrix.tsv       # Presence/absence matrix per HOG per species
├── speciestree_checked.nwk     # Validated/reconciled species tree
├── RootHOG_summary.tsv         # Summary stats per HOG at root level
└── logs/                       # Detailed run logs

Explore the orthologous groups file

Python

import pandas as pd

# Load pairwise orthologs
orthos = pd.read_csv("fastoma_output/OrthologousGroups.tsv", sep="\t",
                     names=["protein1","protein2","type","score"])

print(f"Total orthologous pairs: {len(orthos):,}")
print(orthos["type"].value_counts())
# type column values:
#   1-1: strict one-to-one orthologs (best for phylogenomics)
#   1-m: one-to-many (gene duplication in one lineage)
#   m-1: many-to-one
#   m-m: many-to-many (complex duplication history)

# Filter to 1-1 orthologs only
one_to_one = orthos[orthos["type"] == "1:1"]
print(f"\nOne-to-one orthologs: {len(one_to_one):,}")

# Example output:
# protein1           protein2          type  score
# ARATH|AT1G01010   ORYSJ|Os01g0100   1:1   187.3

4HOG Analysis with pyHam

pyHam is the official Python library for parsing and visualising HOG evolution from FastOMA/OMA output.

Bash

pip install pyham

Python

import pyham

# Load HOGs
ham = pyham.Ham(
    tree_file="fastoma_output/speciestree_checked.nwk",
    hog_file="fastoma_output/HierarchicalGroups.orthoxml",
    use_internal_name=True
)

# Get all HOGs at root level
all_hogs = ham.get_list_top_level_hogs()
print(f"Total root HOGs: {len(all_hogs)}")

# Analyse a single HOG
hog = all_hogs[0]
print(f"HOG ID: {hog.hog_id}")
print(f"Genes: {[g.unique_id for g in hog.get_all_descendant_genes()]}")

# Count single-copy conserved HOGs (useful for species tree inference)
single_copy = [h for h in all_hogs if h.is_single_copy()]
print(f"Single-copy conserved HOGs: {len(single_copy)}")

# Get duplication/loss events
for hog in all_hogs[:5]:
    events = ham.create_iHam(hog)
    print(f"HOG {hog.hog_id}: duplications={events.count_duplication()}, "
          f"losses={events.count_loss()}")

Presence/absence matrix — gene content evolution

Python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load orthologous matrix (HOGs × species)
mat = pd.read_csv("fastoma_output/OrthologousMatrix.tsv", sep="\t", index_col=0)

# Replace gene IDs with 1/0 presence-absence
pa_mat = mat.applymap(lambda x: 0 if pd.isna(x) or x == "" else 1)

# Plot heatmap of gene family presence/absence
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(pa_mat.head(100),
            cmap="Blues", linewidths=0.3,
            cbar_kws={"label": "Gene present"},
            ax=ax)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("HOG (top 100)", fontsize=12)
ax.set_title("Gene family presence/absence across species", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("gene_content_heatmap.pdf", dpi=300, bbox_inches="tight")
plt.show()

5Build a Species Tree from Single-Copy Orthologs

Single-copy orthologs inferred by FastOMA are ideal markers for species tree inference using concatenation or coalescent methods.

Bash

# Extract single-copy HOG sequences for phylogenomics
python3 - <<'EOF'
import pyham, os

ham = pyham.Ham("fastoma_output/speciestree_checked.nwk",
                "fastoma_output/HierarchicalGroups.orthoxml",
                use_internal_name=True)

os.makedirs("single_copy_fastas", exist_ok=True)
count = 0
for hog in ham.get_list_top_level_hogs():
    if hog.is_single_copy() and len(hog.get_all_descendant_genes()) >= 4:
        genes = hog.get_all_descendant_genes()
        with open(f"single_copy_fastas/HOG{hog.hog_id}.fa", "w") as f:
            for g in genes:
                f.write(f">{g.genome.name}\n{g.sequence}\n")
        count += 1
        if count >= 200:   # use top 200 for concatenation
            break
print(f"Extracted {count} single-copy HOGs")
EOF

# Align each HOG with MAFFT
for f in single_copy_fastas/*.fa; do
    mafft --auto --quiet "$f" > "${f%.fa}_aln.fa"
done

# Concatenate alignments for IQ-TREE
python3 - <<'EOF'
from Bio import SeqIO, AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import os, glob

alignments = sorted(glob.glob("single_copy_fastas/*_aln.fa"))
all_species = set()
for aln_file in alignments:
    for rec in SeqIO.parse(aln_file, "fasta"):
        all_species.add(rec.id)

concat = {sp: "" for sp in all_species}
for aln_file in alignments:
    aln = {r.id: str(r.seq) for r in SeqIO.parse(aln_file, "fasta")}
    aln_len = len(next(iter(aln.values())))
    for sp in all_species:
        concat[sp] += aln.get(sp, "-" * aln_len)

with open("concatenated_alignment.fa", "w") as f:
    for sp, seq in concat.items():
        f.write(f">{sp}\n{seq}\n")
print(f"Concatenated {len(alignments)} loci, {len(all_species)} species")
EOF

# Run IQ-TREE on concatenated alignment
iqtree2 -s concatenated_alignment.fa \
        -m TEST \
        -B 1000 \
        -T AUTO \
        --prefix species_tree_fastoma \
        --quiet

Visualise the Results

Plot the inferred species tree with bootstrap values

Python

from ete3 import Tree, TreeStyle, NodeStyle, TextFace
import matplotlib.pyplot as plt

t = Tree("species_tree_fastoma.treefile")

# Style the tree
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = True
ts.branch_vertical_margin = 15
ts.scale = 100

for node in t.traverse():
    ns = NodeStyle()
    if node.is_leaf():
        ns["size"] = 8
        ns["fgcolor"] = "#2196F3"
    else:
        ns["size"] = 5
        ns["fgcolor"] = "#FF5722"
    node.set_style(ns)

t.render("species_tree.pdf", tree_style=ts, w=200, units="mm")
print("Tree saved to species_tree.pdf")

Plot HOG duplication/loss statistics across lineages

Python

import matplotlib.pyplot as plt
import numpy as np

# Simulate stats from a real FastOMA run
species = ["ARATH","ORYSJ","SORBI","MAIZE","LYCES"]
duplications = [1243, 1876, 2341, 3102, 987]
losses = [456, 678, 890, 1203, 312]

x = np.arange(len(species))
width = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x - width/2, duplications, width, label="Duplications", color="#4CAF50", alpha=0.85)
bars2 = ax.bar(x + width/2, losses, width, label="Losses", color="#F44336", alpha=0.85)
ax.set_xlabel("Species", fontsize=12)
ax.set_ylabel("Gene family events", fontsize=12)
ax.set_title("HOG duplication and loss events per lineage", fontsize=14, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(species, rotation=20)
ax.legend()
ax.bar_label(bars1, padding=3, fontsize=9)
ax.bar_label(bars2, padding=3, fontsize=9)
plt.tight_layout()
plt.savefig("hog_events.pdf", dpi=300, bbox_inches="tight")
plt.show()

FastOMA vs OrthoFinder vs OMA Standalone

Feature	FastOMA	OrthoFinder	OMA Standalone
Publication	Nature Methods 2024	Genome Biology 2019	Bioinformatics 2019
Speed (1000 spp)	Hours	Days	Weeks
HOG inference	Yes (hierarchical)	Yes (rooted)	Yes (hierarchical)
OrthoXML output	Yes	No	Yes
Species tree input	Optional	Required for HOGs	Required
Quest for Orthologs accuracy	High (comparable to OMA)	High	Highest
Best for	Large-scale (100–10,000 spp)	10–200 spp	<100 spp, maximum accuracy

Summary

What you learned:

FastOMA infers orthologs and HOGs at scale — handling thousands of species in hours
Input: protein FASTAs with standardised species-code headers + optional species tree
Output: pairwise orthologs (TSV), HOGs (OrthoXML), presence/absence matrix
pyHam parses HOGs for downstream evolution analysis
Single-copy HOGs from FastOMA are ideal input for species tree inference with IQ-TREE

Cite this tool

Majidian S, Nevers Y, Yazdizadeh Kharrazi A, et al. Orthology inference at scale with FastOMA. Nature Methods. 2024. https://doi.org/10.1038/s41592-024-02552-8

FastOMA: Orthology Inference at Scale

On this page

What is FastOMA?

Key Concepts

Ortholog vs Paralog vs Homeolog

Hierarchical Orthologous Groups (HOGs)

OrthoXML Output

Installation

Option 1 — pip (recommended for small/medium runs)

Option 2 — conda

Option 3 — Docker (for reproducibility)

Dependencies

Input Data

Download example proteomes from UniProt

Format protein FASTA headers correctly

1Prepare the Species Tree

2Run FastOMA

Expected runtime

3Understand Output Files

Explore the orthologous groups file

4HOG Analysis with pyHam

Presence/absence matrix — gene content evolution

5Build a Species Tree from Single-Copy Orthologs

Visualise the Results

Plot the inferred species tree with bootstrap values

Plot HOG duplication/loss statistics across lineages

FastOMA vs OrthoFinder vs OMA Standalone

Summary

Cite this tool

Related Tutorials