Phylogenetic Inference — Deep Dive Module 1

Substitution Models & Model Adequacy · Maximum Likelihood at Scale · Bayesian Inference & MCMC Diagnostics

~120 min Advanced Bash / R / Python
Byte

1.1 — Substitution Models & Model Adequacy

Why Standard GTR+G Can Mislead in Deep Phylogenomics

GTR+Γ assumes a single substitution matrix governs all sites. In real proteins, hydrophobic core sites, surface residues, and active sites experience wildly different selective constraints — they don't share one matrix. For deep divergences (>500 Mya), this model violation causes systematic error (long branch attraction, compositional artifacts) that more data cannot overcome.

Hierarchy of Substitution Models (Increasing Realism)
  1. JC/K2P/HKY — toy models, useful only for teaching
  2. GTR+Γ+I — standard empirical model, adequate for moderate divergences
  3. Empirical mixture (C10–C60) — IQ-TREE's pre-computed profiles
  4. PMSF (Posterior Mean Site Frequencies) — fast site-heterogeneous approximation
  5. CAT / CAT-GTR (PhyloBayes) — infinite mixture, gold standard for deep phylogenomics
  6. Covarion / heterotachy — rates at sites change across the tree

Site-Heterogeneous Models in IQ-TREE

IQ-TREE implements the C10–C60 profile mixture models and the empirical PMSF approach, which approximates a full CAT analysis at a fraction of the computational cost.

Bash
# Step 1: Run a guide tree with a standard model
iqtree2 -s alignment.phy -m LG+G -T AUTO --prefix guide

# Step 2: Run C60 (60-profile mixture) with the guide tree
iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \
  -T AUTO -bb 1000 --prefix c60_tree

# Step 3: PMSF — Posterior Mean Site Frequencies (fastest site-het)
# Uses the guide tree to estimate per-site amino acid frequencies
iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \
  -T AUTO -bb 1000 --prefix pmsf_tree

# Compare BIC scores between LG+G and LG+C60+F+G
grep "BIC" guide.iqtree pmsf_tree.iqtree
Download:
ModelProfilesSpeedWhen to Use
LG+C10+F+G10FastQuick exploratory runs
LG+C20+F+G20ModerateGood balance of speed/accuracy
LG+C60+F+G60SlowPublication-quality deep phylogenomics
LG+C60+F+G -ft60 (PMSF)Fast*PMSF approximation — best tradeoff

CAT / CAT-GTR in PhyloBayes

PhyloBayes implements the infinite mixture CAT model — each site is drawn from an infinite number of possible amino acid frequency profiles. This is the most realistic model for deep-time phylogenomics but is computationally very expensive.

Bash
# Run two independent chains (ALWAYS run at least 2)
pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain1 &
pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain2 &

# Monitor convergence (run periodically)
# tracecomp compares parameter traces between chains
tracecomp -x 5000 chain1 chain2
# Look for: maxdiff < 0.1 (ideal < 0.3)
# effsize > 100 for all parameters

# bpcomp compares bipartition (topology) frequencies
bpcomp -x 5000 chain1 chain2
# Look for: maxdiff < 0.1

# Generate consensus tree (after convergence)
# Burn-in first 5000 generations, sample every 10th
bpcomp -x 5000 10 chain1 chain2
Download:
Thumb Rule — When to Use CAT-GTR: If your alignment spans >500 Mya divergences (e.g., animal phyla, plant deep nodes), standard GTR+G is inadequate. Use at minimum C60/PMSF in IQ-TREE, or CAT-GTR in PhyloBayes. For shallower divergences (<100 Mya), GTR+G is usually fine.

Posterior Predictive Simulation for Model Adequacy

How do you know your model fits the data? Posterior predictive simulation generates fake datasets under your model and compares statistics to the real data. Large discrepancies = model violation.

Bash
# PhyloBayes: posterior predictive tests
# After chain has converged:
ppred -comp -x 5000 10 chain1
# Tests compositional homogeneity
# z-score > 2 → model is violated for composition

ppred -div -x 5000 10 chain1
# Tests site-specific biochemical diversity
# z-score > 2 → model underestimates site heterogeneity

ppred -stat -x 5000 10 chain1
# Multiple summary statistics simultaneously

Heterotachy & Covarion Models

Heterotachy = the rate at a site changes across lineages over time (a site that's fast in clade A becomes slow in clade B). Standard models assume constant rates per site.

Bash
# IQ-TREE: Heterotachy-aware models
# GHOST model (General Heterogeneous evolution On a Single Topology)
iqtree2 -s alignment.phy -m "LG+FO*H4" -T AUTO -bb 1000

# H2, H3, H4 = number of rate classes that shift across the tree
# More classes = more parameters = slower

# Covarion model (sites switch between "on" and "off")
# In MrBayes:
# lset nst=6 rates=invgamma covarion=yes;
Mnemonic

"Sites Aren't Static" — SAS

In deep phylogenomics, sites change their evolutionary rate across the tree (heterotachy). A conserved site in vertebrates can be variable in insects. Standard models miss this entirely. Use GHOST (IQ-TREE) or covarion (MrBayes) when you suspect heterotachy.

1.2 — Maximum Likelihood at Scale

Partitioned Analysis: Edge-Linked vs Edge-Unlinked

Phylogenomic datasets often contain hundreds of loci with different evolutionary rates. Partitioned models assign each locus (or codon position) its own substitution parameters while sharing the topology.

Partition ModelWhat's SharedWhat's SeparateIQ-TREE Flag
Edge-linked proportionalTree topology + relative branch lengthsRate multiplier per partition-p (default)
Edge-unlinkedTree topology onlyBranch lengths per partition-Q
Edge-linked equalTopology + branch lengthsSubstitution model only-q
Bash — IQ-TREE Partitioned
# Create a partition file (NEXUS format)
cat << 'EOF' > partitions.nex
#nexus
begin sets;
  charset gene1 = alignment.phy: 1-450;
  charset gene2 = alignment.phy: 451-900;
  charset gene3 = alignment.phy: 901-1350;
  charset gene4_codon1 = alignment.phy: 1351-1800\3;
  charset gene4_codon2 = alignment.phy: 1352-1800\3;
  charset gene4_codon3 = alignment.phy: 1353-1800\3;
  charpartition mine = HKY+G:gene1, GTR+G:gene2, TVM+G:gene3,
                        GTR+G:gene4_codon1, HKY+G:gene4_codon2,
                        GTR+G:gene4_codon3;
end;
EOF

# Edge-linked proportional (standard — recommended for most cases)
iqtree2 -s alignment.phy -p partitions.nex -m MFP+MERGE \
  -T AUTO -bb 1000 --prefix part_linked

# Edge-unlinked (each partition gets its own branch lengths)
# Use for highly heterogeneous rate datasets
iqtree2 -s alignment.phy -Q partitions.nex -m MFP+MERGE \
  -T AUTO -bb 1000 --prefix part_unlinked

# MFP+MERGE: ModelFinder + partition merging (PartitionFinder-like)
# Merges partitions with similar rates to reduce over-parameterization

# For VERY large datasets (>1000 loci):
iqtree2 -S partition_dir/ -m MFP --prefix loci \
  -T AUTO -bb 1000
# -S: each file in directory = one partition
Download:

RAxML-NG for Massive Datasets

Bash
# Partitioned analysis with RAxML-NG
raxml-ng --all --msa concat.phy --model partitions.txt \
  --tree pars{10},rand{10} --bs-trees 200 \
  --threads auto{64} --prefix raxml_part

# partitions.txt format:
# GTR+G, gene1 = 1-450
# GTR+G, gene2 = 451-900
# LG+G+F, protein1 = 1-300

# Checkpoint & recovery (crucial for HPC wall-time limits)
# If job gets killed, just re-run the same command:
raxml-ng --all --msa concat.phy --model partitions.txt \
  --tree pars{10} --bs-trees 200 --threads auto{64} \
  --prefix raxml_part --redo

# Check: is your dataset too large for memory?
raxml-ng --parse --msa concat.phy --model partitions.txt
# Prints estimated memory requirement
Download:

Constrained Tree Searches

Test alternative topologies or enforce known relationships.

Bash
# IQ-TREE: backbone constraint
# Force a monophyletic clade while letting other taxa move freely
echo "((TaxonA,TaxonB,TaxonC),(TaxonD,TaxonE));" > constraint.tre
iqtree2 -s alignment.phy -m MFP -g constraint.tre -bb 1000

# RAxML-NG: constrained search
raxml-ng --all --msa alignment.phy --model GTR+G \
  --tree-constraint constraint.tre --bs-trees 200

HPC Strategies for Large Phylogenomics

Bash — SLURM Script
#!/bin/bash
#SBATCH --job-name=iqtree_phylogenomics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH --mem=128G
#SBATCH --time=7-00:00:00
#SBATCH --output=iqtree_%j.log

module load iqtree/2.2.6

# Safe mode: if job is killed, IQ-TREE resumes from checkpoint
iqtree2 -s phylogenomic_alignment.phy \
  -p partitions.nex \
  -m MFP+MERGE \
  -bb 1000 -alrt 1000 \
  -T ${SLURM_CPUS_PER_TASK} \
  --prefix phylo_result \
  -redo  # remove -redo for checkpoint resume
Download:
Thumb Rule — Concatenation Pitfall: Concatenating 500 loci and running a single GTR+G model is wrong. Always use partitioned analysis (-p or -Q). Even better: run individual gene trees and use ASTRAL for the species tree (Module 1.5).
Mnemonic

"P for Proportional, Q for Quite different"

In IQ-TREE: -p = edge-linked proportional (branch lengths scale by partition rate). -Q = edge-unlinked (each partition has completely independent branch lengths). Use -Q when partitions have very different rates (e.g., coding vs. noncoding).

1.3 — Bayesian Inference & MCMC Diagnostics

Marginal Likelihood Estimation for Model Comparison

In Bayesian phylogenetics, Bayes factors compare models. They require the marginal likelihood — the probability of the data integrated over all parameters and trees. The two gold-standard methods:

MethodAccuracySpeedTool
Stepping-stone samplingExcellentSlow (doubles runtime)MrBayes, BEAST2
Path samplingExcellentSlowBEAST2
Harmonic meanPoor (biased!)FastMrBayes (legacy)
AICMApproximateFastTracer
NEXUS — MrBayes Stepping-Stone
begin mrbayes;
  set autoclose=yes nowarn=yes;
  
  [Model specification]
  lset nst=6 rates=invgamma;
  prset brlenspr=unconstrained:exp(10.0);
  prset shapepr=exponential(1.0);
  
  [Standard MCMC first — to get a starting tree]
  mcmc ngen=5000000 samplefreq=1000 nchains=4 nruns=2;
  sump;
  sumt;
  
  [Stepping-stone sampling for marginal likelihood]
  ss ngen=1000000 diagnfreq=5000 nsteps=50 alpha=0.4;
  
  [Compare models using 2*ln(BF)]
  [BF > 10 = strong evidence for better model]
  [BF > 100 = decisive]
end;
Download:
Never Use Harmonic Mean for Model Comparison

The harmonic mean estimator is severely biased and unreliable. It was the default in older MrBayes versions. Always use stepping-stone or path sampling. If a paper reports harmonic mean Bayes factors, treat the results with skepticism.

Reversible Jump MCMC for Model Averaging

Instead of picking one model, reversible-jump (rjMCMC) explores the model space during the MCMC run, effectively averaging over all models proportional to their posterior probability.

NEXUS — MrBayes Model Averaging
# In MrBayes: enable reversible-jump across all GTR submodels
lset nst=mixed rates=invgamma;
# This uses rjMCMC to sample across JC, K2P, HKY, SYM, GTR etc.
# The posterior probability of each model is reported in sump output

mcmc ngen=10000000 samplefreq=1000 nchains=4 nruns=2;
sump burnin=2500;
# Check: "Model indicator" in sump output shows model posterior probs

MCMC Convergence Diagnostics — Beyond Tracer

Tracer is a good first look, but it misses multimodal posteriors and chain mixing issues. Use additional diagnostics:

R — RWTY Convergence Diagnostics
library(rwty)

# Load MrBayes output (both runs)
my_chains <- load.multi("mrbayes_output", format="mb")

# 1. Trace plots — look for stationarity and mixing
makeplot.trace(my_chains, burnin=25)

# 2. ESS (Effective Sample Size) — want > 200 for all parameters
makeplot.ess(my_chains, burnin=25)

# 3. Topology trace — are chains exploring the same tree space?
makeplot.topology(my_chains, burnin=25)

# 4. ASDSF (Average Standard Deviation of Split Frequencies)
# The gold standard for topology convergence
# Want < 0.01 (MrBayes reports this automatically)

# 5. Compare posterior distributions between chains
makeplot.pairs(my_chains, burnin=25)

# 6. Autocorrelation — chains should mix well
makeplot.autocorr(my_chains, burnin=25)
Download:
R — CODA Package Diagnostics
library(coda)

# Read parameter log as MCMC object
params <- read.table("chain1.p", header=TRUE, skip=1)
mc <- mcmc(params[-(1:2500),])  # remove burn-in

# Gelman-Rubin diagnostic (need 2+ chains)
params2 <- read.table("chain2.p", header=TRUE, skip=1)
mc2 <- mcmc(params2[-(1:2500),])
mc_list <- mcmc.list(mc, mc2)

gelman.diag(mc_list)
# PSRF (Potential Scale Reduction Factor) should be ~1.0
# Upper 95% CI should be < 1.05

# Geweke diagnostic (single chain stationarity)
geweke.diag(mc)
# |z| < 2 for converged parameters

# Raftery-Lewis diagnostic (how long to run)
raftery.diag(mc, q=0.025, r=0.005, s=0.95)
# Tells you minimum chain length needed
Download:

Dealing with Multimodal Posteriors

Byte warns
Multimodal posteriors are the hidden danger of Bayesian phylogenetics. Your chains may converge to different tree topologies and both look well-mixed. Signs: (1) Two runs give different consensus trees. (2) ASDSF stays high (>0.05) even after millions of generations. (3) Topology trace shows chains jumping between islands. Solutions: Run many (4+) independent chains, use Metropolis-coupled MCMC (MC3) with more heated chains (nchains=8), increase temperature spacing (temp=0.2), or run much longer.
MCMC Convergence Checklist
  • ☑ ESS > 200 for all parameters (Tracer)
  • ☑ PSRF ~ 1.0 for all parameters (Gelman-Rubin / CODA)
  • ☑ ASDSF < 0.01 for topology (MrBayes / RWTY)
  • ☑ Traces show stationarity (no trend) after burn-in
  • ☑ Two independent runs give same topology and parameter estimates
  • ☑ Posterior predictive checks pass (PhyloBayes)

BEAST2 Marginal Likelihood (Path Sampling)

XML snippet — BEAST2
# In BEAUti: set up your analysis normally, then
# Analysis → Use Path Sampling/Stepping Stone
# Or add to XML:
# <run spec="beast.inference.PathSampler"
#   nrOfSteps="100" chainLength="1000000"
#   alpha="0.3" rootdir="/path/to/output"
#   burnInPercentage="50">
#   ... (your standard MCMC spec goes here)
# </run>

# After running, check output for:
# "marginal likelihood (path sampling) = -XXXXX.XX"
# Compare between models: 2*ln(BF) > 10 = strong preference
Thumb Rule — Bayes Factor Interpretation:
2·ln(BF) < 2 = not worth mentioning
2·ln(BF) 2–6 = positive evidence
2·ln(BF) 6–10 = strong evidence
2·ln(BF) > 10 = very strong / decisive evidence

AI Prompt Guide

Advanced Phylogenomic Analysis
I have a phylogenomic dataset with [NUMBER] loci ([DNA/protein]), [NUMBER] taxa, spanning [DIVERGENCE TIME, e.g. >500 Mya]. My specific concerns are: - [Compositional bias / Long branch attraction / Heterotachy / Multimodal posteriors] - Data type: [Amino acid / Nucleotide / Codon] - Computing: HPC with [CORES] CPUs, [RAM] GB, max wall time [HOURS] Please write a complete pipeline that: 1. Tests model adequacy with posterior predictive simulations 2. Runs IQ-TREE with C60/PMSF site-heterogeneous models 3. Runs PhyloBayes CAT-GTR as a cross-check 4. Compares models using stepping-stone marginal likelihoods in MrBayes 5. Assesses MCMC convergence with RWTY and CODA in R 6. Includes SLURM scripts for each step 7. Tells me when GTR+G is sufficient vs when I need CAT

Troubleshooting

PhyloBayes: Chain runs for weeks with no convergence

CAT-GTR is computationally brutal. For >500 taxa or >50,000 sites, consider: (1) Reducing alignment by removing constant sites, (2) Using PMSF in IQ-TREE as a fast approximation, (3) Using PhyloBayes-MPI with 64+ cores, (4) Running on a subset of the most informative loci.

IQ-TREE: C60 model takes too much memory

C60 with many taxa needs substantial RAM. Try: (1) C20 instead of C60 (much less RAM), (2) The two-step PMSF approach (guide tree first, then C60 on the fixed tree), (3) --safe flag for more conservative memory usage.

MrBayes: ASDSF stuck at 0.05–0.10 for millions of generations

The chains are stuck in different topology islands. Solutions: (1) Increase nchains to 8+ and temp to 0.1–0.2 for more aggressive heating, (2) Run 4+ independent analyses and check for consistent topologies, (3) Consider whether the data genuinely supports multiple topologies (use topology testing — Section 1.7).

Different models give different topologies

This is actually informative, not a bug. If GTR+G and CAT-GTR disagree, the CAT-GTR topology is usually more reliable for deep divergences. The disagreement itself suggests model misspecification under GTR+G. Report both results and discuss model sensitivity in your paper.

Mnemonics & Thumb Rules

Mnemonic

"Deep Divergence Demands Diversity" — 4D Rule

Deep divergences demand site-heterogeneous models (diverse profiles). If your divergences are >500 Mya, GTR+G is not enough. Use C60/PMSF/CAT.

Mnemonic

"Two chains, Two hundred ESS, point-One ASDSF"

The three numbers for MCMC convergence: ≥2 independent runs, ESS ≥200 for all parameters, ASDSF <0.01 for topology. Miss any one = don't publish.

Thumb Rule

Model complexity ladder

Start simple (GTR+G), check adequacy (posterior predictive), escalate if violated (C60 → CAT-GTR). Don't default to the most complex model — over-parameterization wastes time and can reduce accuracy on small datasets.

Thumb Rule — Burn-in: Default 25% burn-in is often insufficient for complex models. Check trace plots — if the chain hasn't stabilized by 25%, increase to 50%. For PhyloBayes CAT, 50% burn-in is standard.