Substitution Models & Model Adequacy · Maximum Likelihood at Scale · Bayesian Inference & MCMC Diagnostics
GTR+Γ assumes a single substitution matrix governs all sites. In real proteins, hydrophobic core sites, surface residues, and active sites experience wildly different selective constraints — they don't share one matrix. For deep divergences (>500 Mya), this model violation causes systematic error (long branch attraction, compositional artifacts) that more data cannot overcome.
IQ-TREE implements the C10–C60 profile mixture models and the empirical PMSF approach, which approximates a full CAT analysis at a fraction of the computational cost.
# Step 1: Run a guide tree with a standard model iqtree2 -s alignment.phy -m LG+G -T AUTO --prefix guide # Step 2: Run C60 (60-profile mixture) with the guide tree iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \ -T AUTO -bb 1000 --prefix c60_tree # Step 3: PMSF — Posterior Mean Site Frequencies (fastest site-het) # Uses the guide tree to estimate per-site amino acid frequencies iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \ -T AUTO -bb 1000 --prefix pmsf_tree # Compare BIC scores between LG+G and LG+C60+F+G grep "BIC" guide.iqtree pmsf_tree.iqtree
| Model | Profiles | Speed | When to Use |
|---|---|---|---|
LG+C10+F+G | 10 | Fast | Quick exploratory runs |
LG+C20+F+G | 20 | Moderate | Good balance of speed/accuracy |
LG+C60+F+G | 60 | Slow | Publication-quality deep phylogenomics |
LG+C60+F+G -ft | 60 (PMSF) | Fast* | PMSF approximation — best tradeoff |
PhyloBayes implements the infinite mixture CAT model — each site is drawn from an infinite number of possible amino acid frequency profiles. This is the most realistic model for deep-time phylogenomics but is computationally very expensive.
# Run two independent chains (ALWAYS run at least 2) pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain1 & pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain2 & # Monitor convergence (run periodically) # tracecomp compares parameter traces between chains tracecomp -x 5000 chain1 chain2 # Look for: maxdiff < 0.1 (ideal < 0.3) # effsize > 100 for all parameters # bpcomp compares bipartition (topology) frequencies bpcomp -x 5000 chain1 chain2 # Look for: maxdiff < 0.1 # Generate consensus tree (after convergence) # Burn-in first 5000 generations, sample every 10th bpcomp -x 5000 10 chain1 chain2
How do you know your model fits the data? Posterior predictive simulation generates fake datasets under your model and compares statistics to the real data. Large discrepancies = model violation.
# PhyloBayes: posterior predictive tests # After chain has converged: ppred -comp -x 5000 10 chain1 # Tests compositional homogeneity # z-score > 2 → model is violated for composition ppred -div -x 5000 10 chain1 # Tests site-specific biochemical diversity # z-score > 2 → model underestimates site heterogeneity ppred -stat -x 5000 10 chain1 # Multiple summary statistics simultaneously
Heterotachy = the rate at a site changes across lineages over time (a site that's fast in clade A becomes slow in clade B). Standard models assume constant rates per site.
# IQ-TREE: Heterotachy-aware models # GHOST model (General Heterogeneous evolution On a Single Topology) iqtree2 -s alignment.phy -m "LG+FO*H4" -T AUTO -bb 1000 # H2, H3, H4 = number of rate classes that shift across the tree # More classes = more parameters = slower # Covarion model (sites switch between "on" and "off") # In MrBayes: # lset nst=6 rates=invgamma covarion=yes;
"Sites Aren't Static" — SAS
In deep phylogenomics, sites change their evolutionary rate across the tree (heterotachy). A conserved site in vertebrates can be variable in insects. Standard models miss this entirely. Use GHOST (IQ-TREE) or covarion (MrBayes) when you suspect heterotachy.
Phylogenomic datasets often contain hundreds of loci with different evolutionary rates. Partitioned models assign each locus (or codon position) its own substitution parameters while sharing the topology.
| Partition Model | What's Shared | What's Separate | IQ-TREE Flag |
|---|---|---|---|
| Edge-linked proportional | Tree topology + relative branch lengths | Rate multiplier per partition | -p (default) |
| Edge-unlinked | Tree topology only | Branch lengths per partition | -Q |
| Edge-linked equal | Topology + branch lengths | Substitution model only | -q |
# Create a partition file (NEXUS format)
cat << 'EOF' > partitions.nex
#nexus
begin sets;
charset gene1 = alignment.phy: 1-450;
charset gene2 = alignment.phy: 451-900;
charset gene3 = alignment.phy: 901-1350;
charset gene4_codon1 = alignment.phy: 1351-1800\3;
charset gene4_codon2 = alignment.phy: 1352-1800\3;
charset gene4_codon3 = alignment.phy: 1353-1800\3;
charpartition mine = HKY+G:gene1, GTR+G:gene2, TVM+G:gene3,
GTR+G:gene4_codon1, HKY+G:gene4_codon2,
GTR+G:gene4_codon3;
end;
EOF
# Edge-linked proportional (standard — recommended for most cases)
iqtree2 -s alignment.phy -p partitions.nex -m MFP+MERGE \
-T AUTO -bb 1000 --prefix part_linked
# Edge-unlinked (each partition gets its own branch lengths)
# Use for highly heterogeneous rate datasets
iqtree2 -s alignment.phy -Q partitions.nex -m MFP+MERGE \
-T AUTO -bb 1000 --prefix part_unlinked
# MFP+MERGE: ModelFinder + partition merging (PartitionFinder-like)
# Merges partitions with similar rates to reduce over-parameterization
# For VERY large datasets (>1000 loci):
iqtree2 -S partition_dir/ -m MFP --prefix loci \
-T AUTO -bb 1000
# -S: each file in directory = one partition
# Partitioned analysis with RAxML-NG
raxml-ng --all --msa concat.phy --model partitions.txt \
--tree pars{10},rand{10} --bs-trees 200 \
--threads auto{64} --prefix raxml_part
# partitions.txt format:
# GTR+G, gene1 = 1-450
# GTR+G, gene2 = 451-900
# LG+G+F, protein1 = 1-300
# Checkpoint & recovery (crucial for HPC wall-time limits)
# If job gets killed, just re-run the same command:
raxml-ng --all --msa concat.phy --model partitions.txt \
--tree pars{10} --bs-trees 200 --threads auto{64} \
--prefix raxml_part --redo
# Check: is your dataset too large for memory?
raxml-ng --parse --msa concat.phy --model partitions.txt
# Prints estimated memory requirement
Test alternative topologies or enforce known relationships.
# IQ-TREE: backbone constraint # Force a monophyletic clade while letting other taxa move freely echo "((TaxonA,TaxonB,TaxonC),(TaxonD,TaxonE));" > constraint.tre iqtree2 -s alignment.phy -m MFP -g constraint.tre -bb 1000 # RAxML-NG: constrained search raxml-ng --all --msa alignment.phy --model GTR+G \ --tree-constraint constraint.tre --bs-trees 200
#!/bin/bash
#SBATCH --job-name=iqtree_phylogenomics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH --mem=128G
#SBATCH --time=7-00:00:00
#SBATCH --output=iqtree_%j.log
module load iqtree/2.2.6
# Safe mode: if job is killed, IQ-TREE resumes from checkpoint
iqtree2 -s phylogenomic_alignment.phy \
-p partitions.nex \
-m MFP+MERGE \
-bb 1000 -alrt 1000 \
-T ${SLURM_CPUS_PER_TASK} \
--prefix phylo_result \
-redo # remove -redo for checkpoint resume
-p or -Q). Even better: run individual gene trees and use ASTRAL for the species tree (Module 1.5).
"P for Proportional, Q for Quite different"
In IQ-TREE: -p = edge-linked proportional (branch lengths scale by partition rate). -Q = edge-unlinked (each partition has completely independent branch lengths). Use -Q when partitions have very different rates (e.g., coding vs. noncoding).
In Bayesian phylogenetics, Bayes factors compare models. They require the marginal likelihood — the probability of the data integrated over all parameters and trees. The two gold-standard methods:
| Method | Accuracy | Speed | Tool |
|---|---|---|---|
| Stepping-stone sampling | Excellent | Slow (doubles runtime) | MrBayes, BEAST2 |
| Path sampling | Excellent | Slow | BEAST2 |
| Harmonic mean | Poor (biased!) | Fast | MrBayes (legacy) |
| AICM | Approximate | Fast | Tracer |
begin mrbayes; set autoclose=yes nowarn=yes; [Model specification] lset nst=6 rates=invgamma; prset brlenspr=unconstrained:exp(10.0); prset shapepr=exponential(1.0); [Standard MCMC first — to get a starting tree] mcmc ngen=5000000 samplefreq=1000 nchains=4 nruns=2; sump; sumt; [Stepping-stone sampling for marginal likelihood] ss ngen=1000000 diagnfreq=5000 nsteps=50 alpha=0.4; [Compare models using 2*ln(BF)] [BF > 10 = strong evidence for better model] [BF > 100 = decisive] end;
The harmonic mean estimator is severely biased and unreliable. It was the default in older MrBayes versions. Always use stepping-stone or path sampling. If a paper reports harmonic mean Bayes factors, treat the results with skepticism.
Instead of picking one model, reversible-jump (rjMCMC) explores the model space during the MCMC run, effectively averaging over all models proportional to their posterior probability.
# In MrBayes: enable reversible-jump across all GTR submodels lset nst=mixed rates=invgamma; # This uses rjMCMC to sample across JC, K2P, HKY, SYM, GTR etc. # The posterior probability of each model is reported in sump output mcmc ngen=10000000 samplefreq=1000 nchains=4 nruns=2; sump burnin=2500; # Check: "Model indicator" in sump output shows model posterior probs
Tracer is a good first look, but it misses multimodal posteriors and chain mixing issues. Use additional diagnostics:
library(rwty)
# Load MrBayes output (both runs)
my_chains <- load.multi("mrbayes_output", format="mb")
# 1. Trace plots — look for stationarity and mixing
makeplot.trace(my_chains, burnin=25)
# 2. ESS (Effective Sample Size) — want > 200 for all parameters
makeplot.ess(my_chains, burnin=25)
# 3. Topology trace — are chains exploring the same tree space?
makeplot.topology(my_chains, burnin=25)
# 4. ASDSF (Average Standard Deviation of Split Frequencies)
# The gold standard for topology convergence
# Want < 0.01 (MrBayes reports this automatically)
# 5. Compare posterior distributions between chains
makeplot.pairs(my_chains, burnin=25)
# 6. Autocorrelation — chains should mix well
makeplot.autocorr(my_chains, burnin=25)
library(coda)
# Read parameter log as MCMC object
params <- read.table("chain1.p", header=TRUE, skip=1)
mc <- mcmc(params[-(1:2500),]) # remove burn-in
# Gelman-Rubin diagnostic (need 2+ chains)
params2 <- read.table("chain2.p", header=TRUE, skip=1)
mc2 <- mcmc(params2[-(1:2500),])
mc_list <- mcmc.list(mc, mc2)
gelman.diag(mc_list)
# PSRF (Potential Scale Reduction Factor) should be ~1.0
# Upper 95% CI should be < 1.05
# Geweke diagnostic (single chain stationarity)
geweke.diag(mc)
# |z| < 2 for converged parameters
# Raftery-Lewis diagnostic (how long to run)
raftery.diag(mc, q=0.025, r=0.005, s=0.95)
# Tells you minimum chain length needed
nchains=8), increase temperature spacing (temp=0.2), or run much longer.
# In BEAUti: set up your analysis normally, then # Analysis → Use Path Sampling/Stepping Stone # Or add to XML: # <run spec="beast.inference.PathSampler" # nrOfSteps="100" chainLength="1000000" # alpha="0.3" rootdir="/path/to/output" # burnInPercentage="50"> # ... (your standard MCMC spec goes here) # </run> # After running, check output for: # "marginal likelihood (path sampling) = -XXXXX.XX" # Compare between models: 2*ln(BF) > 10 = strong preference
CAT-GTR is computationally brutal. For >500 taxa or >50,000 sites, consider: (1) Reducing alignment by removing constant sites, (2) Using PMSF in IQ-TREE as a fast approximation, (3) Using PhyloBayes-MPI with 64+ cores, (4) Running on a subset of the most informative loci.
C60 with many taxa needs substantial RAM. Try: (1) C20 instead of C60 (much less RAM), (2) The two-step PMSF approach (guide tree first, then C60 on the fixed tree), (3) --safe flag for more conservative memory usage.
The chains are stuck in different topology islands. Solutions: (1) Increase nchains to 8+ and temp to 0.1–0.2 for more aggressive heating, (2) Run 4+ independent analyses and check for consistent topologies, (3) Consider whether the data genuinely supports multiple topologies (use topology testing — Section 1.7).
This is actually informative, not a bug. If GTR+G and CAT-GTR disagree, the CAT-GTR topology is usually more reliable for deep divergences. The disagreement itself suggests model misspecification under GTR+G. Report both results and discuss model sensitivity in your paper.
"Deep Divergence Demands Diversity" — 4D Rule
Deep divergences demand site-heterogeneous models (diverse profiles). If your divergences are >500 Mya, GTR+G is not enough. Use C60/PMSF/CAT.
"Two chains, Two hundred ESS, point-One ASDSF"
The three numbers for MCMC convergence: ≥2 independent runs, ESS ≥200 for all parameters, ASDSF <0.01 for topology. Miss any one = don't publish.
Model complexity ladder
Start simple (GTR+G), check adequacy (posterior predictive), escalate if violated (C60 → CAT-GTR). Don't default to the most complex model — over-parameterization wastes time and can reduce accuracy on small datasets.