Phylogenetics & Tree Building

Align sequences, select evolutionary models, build gene and species trees with IQ-TREE, RAxML-NG, ASTRAL, and visualize with ggtree & iTOL.

~70 min Intermediate Bash / R
Byte

1 Overview

Phylogenetics reconstructs evolutionary relationships. The standard workflow:

Phylogenetics Pipeline
  1. Collect orthologs — single-copy from OrthoFinder, or gene of interest
  2. Align — MAFFT, MUSCLE, or PRANK
  3. Trim — trimAl or Gblocks to remove poorly aligned regions
  4. Select model — ModelFinder (in IQ-TREE) or ProtTest
  5. Build tree — IQ-TREE (ML), RAxML-NG (ML), or MrBayes (Bayesian)
  6. Species tree — ASTRAL from gene trees
  7. Visualize — ggtree, iTOL, FigTree
MethodToolSpeedWhen to Use
Maximum LikelihoodIQ-TREE, RAxML-NGMediumStandard choice for most analyses
Approximate MLFastTreeVery Fast>10,000 sequences, exploratory
BayesianMrBayes, BEAST2SlowDivergence dating, posterior probabilities
CoalescentASTRALFastSpecies tree from gene trees (ILS)

2 Multiple Sequence Alignment

Bash
# MAFFT — fastest, most used
mafft --auto sequences.fasta > aligned.fasta

# MAFFT — high accuracy for <200 sequences
mafft --maxiterate 1000 --localpair sequences.fasta > aligned.fasta

# MUSCLE v5 — newer, competitive accuracy
muscle -align sequences.fasta -output aligned.fasta

# PRANK — for codon-aware alignment (best for Ks/dN/dS)
prank -d=sequences.fasta -o=aligned -codon -F
Download:
Thumb Rule — Aligner Choice: <200 sequences? MAFFT --localpair. 200–10,000? MAFFT --auto. >10,000? MAFFT --retree 1 or MUSCLE. For coding sequences: PRANK with -codon.

3 Alignment Trimming

Bash
# trimAl — automated trimming (recommended)
trimal -in aligned.fasta -out trimmed.fasta -automated1

# trimAl — gap threshold (remove columns with >50% gaps)
trimal -in aligned.fasta -out trimmed.fasta -gt 0.5

# Gblocks — stricter trimming
Gblocks aligned.fasta -t=p -b4=5 -b5=h -e=-gb

# ClipKIT — newer, smart trimming
clipkit aligned.fasta -o trimmed.fasta -m kpic-smart-gap
Byte warns
Don't over-trim! Aggressive trimming can remove phylogenetically informative sites. Start with trimal -automated1. If your alignment is already clean (e.g., single-copy orthologs), minimal trimming is fine. Always visually inspect in AliView or Jalview.

4 Evolutionary Model Selection

The substitution model describes how sequences evolve. Wrong model = wrong tree.

Bash
# IQ-TREE's ModelFinder (built-in, best option)
iqtree2 -s trimmed.fasta -m MFP   # MFP = ModelFinder Plus

# Standalone ModelTest-NG
modeltest-ng -i trimmed.fasta -t ml -p 8

# Output example:
# Best model: GTR+F+I+G4 (DNA) or LG+F+G4 (protein)
# BIC score: 45678.234
Data TypeCommon Best ModelsWhat They Model
DNAGTR+G, GTR+I+G, HKY+GBase frequencies, transition/transversion rates, rate variation
ProteinLG+G, WAG+G, JTT+GAmino acid substitution matrices + rate heterogeneity
CodonMG+F3X4, GY+FCodon substitution rates (for selection analysis)
Mnemonic

"Gamma Is Great" — G+I+G

+G = gamma rate variation (different sites evolve at different rates). +I = invariant sites (some sites never change). Almost every real dataset needs +G. Let ModelFinder decide if you also need +I.

5 IQ-TREE — Maximum Likelihood

IQ-TREE2 is the most versatile ML tree builder. Fast, accurate, and feature-rich.

Bash
# Standard analysis: model selection + tree + bootstrap
iqtree2 -s trimmed.fasta \
  -m MFP \
  -bb 1000 \
  -alrt 1000 \
  -nt AUTO \
  --prefix my_tree

# Partition analysis (different genes, different models)
# Create partition file: partition.nex
iqtree2 -s concatenated.fasta \
  -p partition.nex \
  -m MFP \
  -bb 1000 \
  -nt AUTO

# Constrained tree search (test topology)
iqtree2 -s trimmed.fasta -m MFP -g constraint_tree.nwk -bb 1000

# Key output files:
# my_tree.treefile — Newick tree (open in FigTree/iTOL)
# my_tree.iqtree   — full log with model, scores, support
# my_tree.contree  — consensus tree with bootstrap values
Download:
ParameterWhat It DoesRecommendation
-bb 1000Ultrafast bootstrap (UFBoot2)Always use. ≥95% = strong support.
-alrt 1000SH-aLRT testUse alongside UFBoot. ≥80% = supported.
-m MFPModelFinder + tree buildingAlways let IQ-TREE choose the model.
-nt AUTOAuto-detect CPU coresOr specify: -nt 16
Thumb Rule — Branch Support: UFBoot ≥95% AND SH-aLRT ≥80% = strongly supported node. If only one metric passes, the node is weakly supported. Report both values in publications.

6 RAxML-NG

Bash
# Standard ML tree with bootstrap
raxml-ng --all \
  --msa trimmed.fasta \
  --model GTR+G \
  --bs-trees 200 \
  --threads auto{16} \
  --prefix raxml_tree

# With auto model selection (use modeltest-ng first)
raxml-ng --all --msa trimmed.fasta --model LG+G8+F --bs-trees 200

# Check alignment (sanity check before running)
raxml-ng --check --msa trimmed.fasta --model GTR+G
Download:

7 FastTree — Quick Approximate Trees

Bash
# DNA (GTR model)
FastTree -gtr -nt trimmed.fasta > tree.nwk

# Protein (default JTT+CAT model)
FastTree trimmed_protein.fasta > tree.nwk

# With local support values (more reliable than default SH)
FastTree -gtr -nt -gamma trimmed.fasta > tree.nwk
Byte warns
FastTree caveats: Support values from FastTree are NOT bootstrap values — they're approximate SH-like local support. They tend to be inflated. For publications, use IQ-TREE or RAxML-NG with proper bootstrapping. FastTree is for exploring large datasets (>5,000 tips) or quick checks.

8 MrBayes — Bayesian Inference

NEXUS
# Create a NEXUS file or add MrBayes block to your alignment
begin mrbayes;
  set autoclose=yes nowarn=yes;
  lset nst=6 rates=invgamma;    [GTR+I+G]
  prset brlenspr=unconstrained:exp(10.0);
  mcmc ngen=5000000 samplefreq=1000
       nchains=4 nruns=2
       printfreq=10000;
  sump burnin=1250;
  sumt burnin=1250 contype=allcompat;
end;
Download:
Thumb Rule — Convergence: Posterior probability ≥0.95 = strong support. Check convergence: PSRF (potential scale reduction factor) should be ~1.0. Standard deviation of split frequencies should be <0.01.

9 Species Trees with ASTRAL

Gene trees can conflict due to incomplete lineage sorting (ILS). ASTRAL estimates the species tree from many gene trees using the coalescent model.

Bash
# Step 1: Build gene trees for single-copy orthologs
# (From OrthoFinder single-copy orthologs)
for fasta in orthologs/*.fasta; do
  name=$(basename "$fasta" .fasta)
  mafft --auto "$fasta" > "aligned/${name}.aln"
  trimal -in "aligned/${name}.aln" -out "trimmed/${name}.trim" -automated1
  iqtree2 -s "trimmed/${name}.trim" -m MFP -bb 1000 -nt 2 --prefix "trees/${name}"
done

# Step 2: Concatenate all gene trees
cat trees/*.treefile > all_gene_trees.nwk

# Step 3: Run ASTRAL
java -jar astral.jar \
  -i all_gene_trees.nwk \
  -o species_tree.nwk \
  2> astral.log

# Step 4: With branch support (local posterior probability)
java -jar astral.jar -i all_gene_trees.nwk -o species_tree.nwk -t 2
Download:
Mnemonic

"Gene trees ≠ Species tree" — GS conflict

Individual gene trees can disagree with the true species tree due to ILS, hybridization, or gene duplication/loss. ASTRAL handles ILS. For introgression, use PhyloNet or Dsuite.

10 Tree Visualization

R
# ggtree — ggplot2-based tree visualization
library(ggtree)
library(treeio)

tree <- read.tree("species_tree.nwk")

# Basic tree
ggtree(tree) +
  geom_tiplab(size=3) +
  geom_nodelab(aes(label=label), size=2, hjust=-0.1) +
  theme_tree2()

# Circular tree with bootstrap colors
ggtree(tree, layout="circular") +
  geom_tiplab2(size=2.5) +
  geom_nodepoint(aes(color=as.numeric(label) > 95), size=2) +
  scale_color_manual(values=c("grey70","red"), labels=c("<95%","≥95%")) +
  theme(legend.position="bottom")

# With heatmap annotation
library(ggtreeExtra)
p <- ggtree(tree) + geom_tiplab()
gheatmap(p, gene_count_matrix, colnames_angle=90, font.size=2)
Download:
ggtree (R)

ggplot2 grammar for trees — most flexible

R
iTOL

Web-based, drag-and-drop annotation

Web
FigTree

Desktop viewer, easy node annotation

Desktop
ETE3 (Python)

Programmatic tree manipulation + viz

Python
Dendroscope

Large trees, tanglegrams

Desktop
toytree (Python)

Simple Pythonic tree drawing

Python

11 AI Prompt Guide

Phylogenetic Analysis for Your Data
I have [NUMBER] [DNA / protein] sequences from [ORGANISM GROUP]. They are [single-gene / multi-gene from OrthoFinder / whole-genome]. I need a [gene tree / species tree / dated tree]. Number of taxa (tips): [NUMBER]. Please write a pipeline that: 1. Aligns with MAFFT (appropriate mode for my sequence count) 2. Trims with trimAl 3. Selects the best model with ModelFinder 4. Builds an ML tree with IQ-TREE (UFBoot + SH-aLRT) 5. [If species tree: builds individual gene trees then runs ASTRAL] 6. Visualizes with ggtree in R, coloring branches by bootstrap support 7. Includes a SLURM wrapper for HPC

12 Common Errors

IQ-TREE: "Alignment has X constant sites"

This is a warning, not an error. Constant sites are normal. If you're using +ASC (ascertainment bias correction for SNP data only), remove constant sites first.

IQ-TREE: "WARNING: bootstrap analysis did not converge"

Increase -bb from 1000 to 5000 or 10000. This happens with very short alignments or highly diverged sequences.

Long branch attraction (LBA)

Fast-evolving taxa get artificially grouped together. Remedies: use better models (+G), remove long-branch taxa, use amino acids instead of DNA, or try Bayesian inference (less susceptible).

ASTRAL: Very low quartet support

Gene tree estimation error or very deep ILS. Try: (1) better gene trees (more data, better models), (2) remove short genes (<500 bp), (3) collapse low-support branches in gene trees before ASTRAL.

13 All Phylogenetics Tools

IQ-TREE2

Best all-around ML (ModelFinder + UFBoot)

ML
RAxML-NG

Fast ML, excellent for large datasets

ML
FastTree

Approximate ML for >10k sequences

Fast
MrBayes

Bayesian inference with MCMC

Bayesian
BEAST2

Bayesian + molecular clock dating

Dating
ASTRAL

Species tree from gene trees (coalescent)

Species
MAFFT

Fastest, most versatile MSA tool

Alignment
MUSCLE v5

Competitive MSA for proteins

Alignment
trimAl

Automated alignment trimming

Trimming
PhyML

Classic ML phylogenetics

ML
PAML (codeml)

dN/dS, positive selection tests

Selection
HyPhy

Advanced selection analysis (MEME, BUSTED)

Selection
r8s / treePL

Divergence time estimation

Dating
MCMCtree

PAML's Bayesian divergence dating

Dating

Mnemonics & Thumb Rules

Mnemonic

"Align → Trim → Model → Tree → Viz" — ATMTV

Never skip a step. Never build a tree without selecting a model. Never publish a tree without bootstrap support.

Thumb Rule

Bootstrap interpretation

UFBoot ≥95% = strong. 80–95% = moderate. <80% = weak. Standard bootstrap ≥70% is considered supported. Never confuse UFBoot with standard bootstrap — UFBoot thresholds are higher.

Mnemonic

"IQ for Quality, Fast for First look"

Use IQ-TREE for your final publication tree. Use FastTree for quick exploratory looks at large datasets. Use MrBayes when you need posterior probabilities or divergence time estimates.

Thumb Rule — Minimum taxa: You need at least 4 taxa for a meaningful unrooted tree. For robust phylogenetics: 20+ taxa recommended. For species tree with ASTRAL: 50+ gene trees recommended.

Summary

Byte
You can now:
  1. Align sequences with MAFFT and trim with trimAl
  2. Select the best substitution model with ModelFinder
  3. Build ML trees with IQ-TREE and RAxML-NG
  4. Run Bayesian analysis with MrBayes
  5. Estimate species trees from gene trees with ASTRAL
  6. Create beautiful tree figures with ggtree and iTOL