Align sequences, select evolutionary models, build gene and species trees with IQ-TREE, RAxML-NG, ASTRAL, and visualize with ggtree & iTOL.
Phylogenetics reconstructs evolutionary relationships. The standard workflow:
| Method | Tool | Speed | When to Use |
|---|---|---|---|
| Maximum Likelihood | IQ-TREE, RAxML-NG | Medium | Standard choice for most analyses |
| Approximate ML | FastTree | Very Fast | >10,000 sequences, exploratory |
| Bayesian | MrBayes, BEAST2 | Slow | Divergence dating, posterior probabilities |
| Coalescent | ASTRAL | Fast | Species tree from gene trees (ILS) |
# MAFFT — fastest, most used mafft --auto sequences.fasta > aligned.fasta # MAFFT — high accuracy for <200 sequences mafft --maxiterate 1000 --localpair sequences.fasta > aligned.fasta # MUSCLE v5 — newer, competitive accuracy muscle -align sequences.fasta -output aligned.fasta # PRANK — for codon-aware alignment (best for Ks/dN/dS) prank -d=sequences.fasta -o=aligned -codon -F
--localpair. 200–10,000? MAFFT --auto. >10,000? MAFFT --retree 1 or MUSCLE. For coding sequences: PRANK with -codon.
# trimAl — automated trimming (recommended) trimal -in aligned.fasta -out trimmed.fasta -automated1 # trimAl — gap threshold (remove columns with >50% gaps) trimal -in aligned.fasta -out trimmed.fasta -gt 0.5 # Gblocks — stricter trimming Gblocks aligned.fasta -t=p -b4=5 -b5=h -e=-gb # ClipKIT — newer, smart trimming clipkit aligned.fasta -o trimmed.fasta -m kpic-smart-gap
trimal -automated1. If your alignment is already clean (e.g., single-copy orthologs), minimal trimming is fine. Always visually inspect in AliView or Jalview.
The substitution model describes how sequences evolve. Wrong model = wrong tree.
# IQ-TREE's ModelFinder (built-in, best option) iqtree2 -s trimmed.fasta -m MFP # MFP = ModelFinder Plus # Standalone ModelTest-NG modeltest-ng -i trimmed.fasta -t ml -p 8 # Output example: # Best model: GTR+F+I+G4 (DNA) or LG+F+G4 (protein) # BIC score: 45678.234
| Data Type | Common Best Models | What They Model |
|---|---|---|
| DNA | GTR+G, GTR+I+G, HKY+G | Base frequencies, transition/transversion rates, rate variation |
| Protein | LG+G, WAG+G, JTT+G | Amino acid substitution matrices + rate heterogeneity |
| Codon | MG+F3X4, GY+F | Codon substitution rates (for selection analysis) |
"Gamma Is Great" — G+I+G
+G = gamma rate variation (different sites evolve at different rates). +I = invariant sites (some sites never change). Almost every real dataset needs +G. Let ModelFinder decide if you also need +I.
IQ-TREE2 is the most versatile ML tree builder. Fast, accurate, and feature-rich.
# Standard analysis: model selection + tree + bootstrap iqtree2 -s trimmed.fasta \ -m MFP \ -bb 1000 \ -alrt 1000 \ -nt AUTO \ --prefix my_tree # Partition analysis (different genes, different models) # Create partition file: partition.nex iqtree2 -s concatenated.fasta \ -p partition.nex \ -m MFP \ -bb 1000 \ -nt AUTO # Constrained tree search (test topology) iqtree2 -s trimmed.fasta -m MFP -g constraint_tree.nwk -bb 1000 # Key output files: # my_tree.treefile — Newick tree (open in FigTree/iTOL) # my_tree.iqtree — full log with model, scores, support # my_tree.contree — consensus tree with bootstrap values
| Parameter | What It Does | Recommendation |
|---|---|---|
-bb 1000 | Ultrafast bootstrap (UFBoot2) | Always use. ≥95% = strong support. |
-alrt 1000 | SH-aLRT test | Use alongside UFBoot. ≥80% = supported. |
-m MFP | ModelFinder + tree building | Always let IQ-TREE choose the model. |
-nt AUTO | Auto-detect CPU cores | Or specify: -nt 16 |
# Standard ML tree with bootstrap
raxml-ng --all \
--msa trimmed.fasta \
--model GTR+G \
--bs-trees 200 \
--threads auto{16} \
--prefix raxml_tree
# With auto model selection (use modeltest-ng first)
raxml-ng --all --msa trimmed.fasta --model LG+G8+F --bs-trees 200
# Check alignment (sanity check before running)
raxml-ng --check --msa trimmed.fasta --model GTR+G
# DNA (GTR model) FastTree -gtr -nt trimmed.fasta > tree.nwk # Protein (default JTT+CAT model) FastTree trimmed_protein.fasta > tree.nwk # With local support values (more reliable than default SH) FastTree -gtr -nt -gamma trimmed.fasta > tree.nwk
# Create a NEXUS file or add MrBayes block to your alignment
begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=invgamma; [GTR+I+G]
prset brlenspr=unconstrained:exp(10.0);
mcmc ngen=5000000 samplefreq=1000
nchains=4 nruns=2
printfreq=10000;
sump burnin=1250;
sumt burnin=1250 contype=allcompat;
end;
Gene trees can conflict due to incomplete lineage sorting (ILS). ASTRAL estimates the species tree from many gene trees using the coalescent model.
# Step 1: Build gene trees for single-copy orthologs
# (From OrthoFinder single-copy orthologs)
for fasta in orthologs/*.fasta; do
name=$(basename "$fasta" .fasta)
mafft --auto "$fasta" > "aligned/${name}.aln"
trimal -in "aligned/${name}.aln" -out "trimmed/${name}.trim" -automated1
iqtree2 -s "trimmed/${name}.trim" -m MFP -bb 1000 -nt 2 --prefix "trees/${name}"
done
# Step 2: Concatenate all gene trees
cat trees/*.treefile > all_gene_trees.nwk
# Step 3: Run ASTRAL
java -jar astral.jar \
-i all_gene_trees.nwk \
-o species_tree.nwk \
2> astral.log
# Step 4: With branch support (local posterior probability)
java -jar astral.jar -i all_gene_trees.nwk -o species_tree.nwk -t 2
"Gene trees ≠ Species tree" — GS conflict
Individual gene trees can disagree with the true species tree due to ILS, hybridization, or gene duplication/loss. ASTRAL handles ILS. For introgression, use PhyloNet or Dsuite.
# ggtree — ggplot2-based tree visualization
library(ggtree)
library(treeio)
tree <- read.tree("species_tree.nwk")
# Basic tree
ggtree(tree) +
geom_tiplab(size=3) +
geom_nodelab(aes(label=label), size=2, hjust=-0.1) +
theme_tree2()
# Circular tree with bootstrap colors
ggtree(tree, layout="circular") +
geom_tiplab2(size=2.5) +
geom_nodepoint(aes(color=as.numeric(label) > 95), size=2) +
scale_color_manual(values=c("grey70","red"), labels=c("<95%","≥95%")) +
theme(legend.position="bottom")
# With heatmap annotation
library(ggtreeExtra)
p <- ggtree(tree) + geom_tiplab()
gheatmap(p, gene_count_matrix, colnames_angle=90, font.size=2)
ggplot2 grammar for trees — most flexible
RWeb-based, drag-and-drop annotation
WebDesktop viewer, easy node annotation
DesktopProgrammatic tree manipulation + viz
PythonLarge trees, tanglegrams
DesktopSimple Pythonic tree drawing
PythonThis is a warning, not an error. Constant sites are normal. If you're using +ASC (ascertainment bias correction for SNP data only), remove constant sites first.
Increase -bb from 1000 to 5000 or 10000. This happens with very short alignments or highly diverged sequences.
Fast-evolving taxa get artificially grouped together. Remedies: use better models (+G), remove long-branch taxa, use amino acids instead of DNA, or try Bayesian inference (less susceptible).
Gene tree estimation error or very deep ILS. Try: (1) better gene trees (more data, better models), (2) remove short genes (<500 bp), (3) collapse low-support branches in gene trees before ASTRAL.
Best all-around ML (ModelFinder + UFBoot)
MLFast ML, excellent for large datasets
MLApproximate ML for >10k sequences
FastBayesian inference with MCMC
BayesianBayesian + molecular clock dating
DatingSpecies tree from gene trees (coalescent)
SpeciesFastest, most versatile MSA tool
AlignmentCompetitive MSA for proteins
AlignmentAutomated alignment trimming
TrimmingClassic ML phylogenetics
MLdN/dS, positive selection tests
SelectionAdvanced selection analysis (MEME, BUSTED)
SelectionDivergence time estimation
DatingPAML's Bayesian divergence dating
Dating"Align → Trim → Model → Tree → Viz" — ATMTV
Never skip a step. Never build a tree without selecting a model. Never publish a tree without bootstrap support.
Bootstrap interpretation
UFBoot ≥95% = strong. 80–95% = moderate. <80% = weak. Standard bootstrap ≥70% is considered supported. Never confuse UFBoot with standard bootstrap — UFBoot thresholds are higher.
"IQ for Quality, Fast for First look"
Use IQ-TREE for your final publication tree. Use FastTree for quick exploratory looks at large datasets. Use MrBayes when you need posterior probabilities or divergence time estimates.