Phylogenetic Inference Deep Dive — Module 1: Substitution Models, ML at Scale, Bayesian MCMC

1.1 — Substitution Models & Model Adequacy

Why Standard GTR+G Can Mislead in Deep Phylogenomics

GTR+Γ assumes a single substitution matrix governs all sites. In real proteins, hydrophobic core sites, surface residues, and active sites experience wildly different selective constraints — they don't share one matrix. For deep divergences (>500 Mya), this model violation causes systematic error (long branch attraction, compositional artifacts) that more data cannot overcome.

Hierarchy of Substitution Models (Increasing Realism)

JC/K2P/HKY — toy models, useful only for teaching
GTR+Γ+I — standard empirical model, adequate for moderate divergences
Empirical mixture (C10–C60) — IQ-TREE's pre-computed profiles
PMSF (Posterior Mean Site Frequencies) — fast site-heterogeneous approximation
CAT / CAT-GTR (PhyloBayes) — infinite mixture, gold standard for deep phylogenomics
Covarion / heterotachy — rates at sites change across the tree

Site-Heterogeneous Models in IQ-TREE

IQ-TREE implements the C10–C60 profile mixture models and the empirical PMSF approach, which approximates a full CAT analysis at a fraction of the computational cost.

Bash

# Step 1: Run a guide tree with a standard model
iqtree2 -s alignment.phy -m LG+G -T AUTO --prefix guide

# Step 2: Run C60 (60-profile mixture) with the guide tree
iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \
  -T AUTO -bb 1000 --prefix c60_tree

# Step 3: PMSF — Posterior Mean Site Frequencies (fastest site-het)
# Uses the guide tree to estimate per-site amino acid frequencies
iqtree2 -s alignment.phy -m LG+C60+F+G -ft guide.treefile \
  -T AUTO -bb 1000 --prefix pmsf_tree

# Compare BIC scores between LG+G and LG+C60+F+G
grep "BIC" guide.iqtree pmsf_tree.iqtree

Download:

Model	Profiles	Speed	When to Use
`LG+C10+F+G`	10	Fast	Quick exploratory runs
`LG+C20+F+G`	20	Moderate	Good balance of speed/accuracy
`LG+C60+F+G`	60	Slow	Publication-quality deep phylogenomics
`LG+C60+F+G -ft`	60 (PMSF)	Fast*	PMSF approximation — best tradeoff

CAT / CAT-GTR in PhyloBayes

PhyloBayes implements the infinite mixture CAT model — each site is drawn from an infinite number of possible amino acid frequency profiles. This is the most realistic model for deep-time phylogenomics but is computationally very expensive.

Bash

# Run two independent chains (ALWAYS run at least 2)
pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain1 &
pb_mpi -d alignment.phy -cat -gtr -dgam 4 chain2 &

# Monitor convergence (run periodically)
# tracecomp compares parameter traces between chains
tracecomp -x 5000 chain1 chain2
# Look for: maxdiff < 0.1 (ideal < 0.3)
# effsize > 100 for all parameters

# bpcomp compares bipartition (topology) frequencies
bpcomp -x 5000 chain1 chain2
# Look for: maxdiff < 0.1

# Generate consensus tree (after convergence)
# Burn-in first 5000 generations, sample every 10th
bpcomp -x 5000 10 chain1 chain2

Download:

Thumb Rule — When to Use CAT-GTR: If your alignment spans >500 Mya divergences (e.g., animal phyla, plant deep nodes), standard GTR+G is inadequate. Use at minimum C60/PMSF in IQ-TREE, or CAT-GTR in PhyloBayes. For shallower divergences (<100 Mya), GTR+G is usually fine.

Posterior Predictive Simulation for Model Adequacy

How do you know your model fits the data? Posterior predictive simulation generates fake datasets under your model and compares statistics to the real data. Large discrepancies = model violation.

Bash

# PhyloBayes: posterior predictive tests
# After chain has converged:
ppred -comp -x 5000 10 chain1
# Tests compositional homogeneity
# z-score > 2 → model is violated for composition

ppred -div -x 5000 10 chain1
# Tests site-specific biochemical diversity
# z-score > 2 → model underestimates site heterogeneity

ppred -stat -x 5000 10 chain1
# Multiple summary statistics simultaneously

Heterotachy & Covarion Models

Heterotachy = the rate at a site changes across lineages over time (a site that's fast in clade A becomes slow in clade B). Standard models assume constant rates per site.

Bash

# IQ-TREE: Heterotachy-aware models
# GHOST model (General Heterogeneous evolution On a Single Topology)
iqtree2 -s alignment.phy -m "LG+FO*H4" -T AUTO -bb 1000

# H2, H3, H4 = number of rate classes that shift across the tree
# More classes = more parameters = slower

# Covarion model (sites switch between "on" and "off")
# In MrBayes:
# lset nst=6 rates=invgamma covarion=yes;

Mnemonic

"Sites Aren't Static" — SAS

In deep phylogenomics, sites change their evolutionary rate across the tree (heterotachy). A conserved site in vertebrates can be variable in insects. Standard models miss this entirely. Use GHOST (IQ-TREE) or covarion (MrBayes) when you suspect heterotachy.

1.2 — Maximum Likelihood at Scale

Partitioned Analysis: Edge-Linked vs Edge-Unlinked

Phylogenomic datasets often contain hundreds of loci with different evolutionary rates. Partitioned models assign each locus (or codon position) its own substitution parameters while sharing the topology.

Partition Model	What's Shared	What's Separate	IQ-TREE Flag
Edge-linked proportional	Tree topology + relative branch lengths	Rate multiplier per partition	`-p` (default)
Edge-unlinked	Tree topology only	Branch lengths per partition	`-Q`
Edge-linked equal	Topology + branch lengths	Substitution model only	`-q`

Bash — IQ-TREE Partitioned

# Create a partition file (NEXUS format)
cat << 'EOF' > partitions.nex
#nexus
begin sets;
  charset gene1 = alignment.phy: 1-450;
  charset gene2 = alignment.phy: 451-900;
  charset gene3 = alignment.phy: 901-1350;
  charset gene4_codon1 = alignment.phy: 1351-1800\3;
  charset gene4_codon2 = alignment.phy: 1352-1800\3;
  charset gene4_codon3 = alignment.phy: 1353-1800\3;
  charpartition mine = HKY+G:gene1, GTR+G:gene2, TVM+G:gene3,
                        GTR+G:gene4_codon1, HKY+G:gene4_codon2,
                        GTR+G:gene4_codon3;
end;
EOF

# Edge-linked proportional (standard — recommended for most cases)
iqtree2 -s alignment.phy -p partitions.nex -m MFP+MERGE \
  -T AUTO -bb 1000 --prefix part_linked

# Edge-unlinked (each partition gets its own branch lengths)
# Use for highly heterogeneous rate datasets
iqtree2 -s alignment.phy -Q partitions.nex -m MFP+MERGE \
  -T AUTO -bb 1000 --prefix part_unlinked

# MFP+MERGE: ModelFinder + partition merging (PartitionFinder-like)
# Merges partitions with similar rates to reduce over-parameterization

# For VERY large datasets (>1000 loci):
iqtree2 -S partition_dir/ -m MFP --prefix loci \
  -T AUTO -bb 1000
# -S: each file in directory = one partition

Download:

RAxML-NG for Massive Datasets

Bash

# Partitioned analysis with RAxML-NG
raxml-ng --all --msa concat.phy --model partitions.txt \
  --tree pars{10},rand{10} --bs-trees 200 \
  --threads auto{64} --prefix raxml_part

# partitions.txt format:
# GTR+G, gene1 = 1-450
# GTR+G, gene2 = 451-900
# LG+G+F, protein1 = 1-300

# Checkpoint & recovery (crucial for HPC wall-time limits)
# If job gets killed, just re-run the same command:
raxml-ng --all --msa concat.phy --model partitions.txt \
  --tree pars{10} --bs-trees 200 --threads auto{64} \
  --prefix raxml_part --redo

# Check: is your dataset too large for memory?
raxml-ng --parse --msa concat.phy --model partitions.txt
# Prints estimated memory requirement

Download:

Constrained Tree Searches

Test alternative topologies or enforce known relationships.

Bash

# IQ-TREE: backbone constraint
# Force a monophyletic clade while letting other taxa move freely
echo "((TaxonA,TaxonB,TaxonC),(TaxonD,TaxonE));" > constraint.tre
iqtree2 -s alignment.phy -m MFP -g constraint.tre -bb 1000

# RAxML-NG: constrained search
raxml-ng --all --msa alignment.phy --model GTR+G \
  --tree-constraint constraint.tre --bs-trees 200

HPC Strategies for Large Phylogenomics

Bash — SLURM Script

#!/bin/bash
#SBATCH --job-name=iqtree_phylogenomics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH --mem=128G
#SBATCH --time=7-00:00:00
#SBATCH --output=iqtree_%j.log

module load iqtree/2.2.6

# Safe mode: if job is killed, IQ-TREE resumes from checkpoint
iqtree2 -s phylogenomic_alignment.phy \
  -p partitions.nex \
  -m MFP+MERGE \
  -bb 1000 -alrt 1000 \
  -T ${SLURM_CPUS_PER_TASK} \
  --prefix phylo_result \
  -redo  # remove -redo for checkpoint resume

Download:

Thumb Rule — Concatenation Pitfall: Concatenating 500 loci and running a single GTR+G model is wrong. Always use partitioned analysis (-p or -Q). Even better: run individual gene trees and use ASTRAL for the species tree (Module 1.5).

Mnemonic

"P for Proportional, Q for Quite different"

In IQ-TREE: -p = edge-linked proportional (branch lengths scale by partition rate). -Q = edge-unlinked (each partition has completely independent branch lengths). Use -Q when partitions have very different rates (e.g., coding vs. noncoding).

1.3 — Bayesian Inference & MCMC Diagnostics

Marginal Likelihood Estimation for Model Comparison

In Bayesian phylogenetics, Bayes factors compare models. They require the marginal likelihood — the probability of the data integrated over all parameters and trees. The two gold-standard methods:

Method	Accuracy	Speed	Tool
Stepping-stone sampling	Excellent	Slow (doubles runtime)	MrBayes, BEAST2
Path sampling	Excellent	Slow	BEAST2
Harmonic mean	Poor (biased!)	Fast	MrBayes (legacy)
AICM	Approximate	Fast	Tracer

NEXUS — MrBayes Stepping-Stone

begin mrbayes;
  set autoclose=yes nowarn=yes;
  
  [Model specification]
  lset nst=6 rates=invgamma;
  prset brlenspr=unconstrained:exp(10.0);
  prset shapepr=exponential(1.0);
  
  [Standard MCMC first — to get a starting tree]
  mcmc ngen=5000000 samplefreq=1000 nchains=4 nruns=2;
  sump;
  sumt;
  
  [Stepping-stone sampling for marginal likelihood]
  ss ngen=1000000 diagnfreq=5000 nsteps=50 alpha=0.4;
  
  [Compare models using 2*ln(BF)]
  [BF > 10 = strong evidence for better model]
  [BF > 100 = decisive]
end;

Download:

Never Use Harmonic Mean for Model Comparison

The harmonic mean estimator is severely biased and unreliable. It was the default in older MrBayes versions. Always use stepping-stone or path sampling. If a paper reports harmonic mean Bayes factors, treat the results with skepticism.

Reversible Jump MCMC for Model Averaging

Instead of picking one model, reversible-jump (rjMCMC) explores the model space during the MCMC run, effectively averaging over all models proportional to their posterior probability.

NEXUS — MrBayes Model Averaging

# In MrBayes: enable reversible-jump across all GTR submodels
lset nst=mixed rates=invgamma;
# This uses rjMCMC to sample across JC, K2P, HKY, SYM, GTR etc.
# The posterior probability of each model is reported in sump output

mcmc ngen=10000000 samplefreq=1000 nchains=4 nruns=2;
sump burnin=2500;
# Check: "Model indicator" in sump output shows model posterior probs

MCMC Convergence Diagnostics — Beyond Tracer

Tracer is a good first look, but it misses multimodal posteriors and chain mixing issues. Use additional diagnostics:

R — RWTY Convergence Diagnostics

library(rwty)

# Load MrBayes output (both runs)
my_chains <- load.multi("mrbayes_output", format="mb")

# 1. Trace plots — look for stationarity and mixing
makeplot.trace(my_chains, burnin=25)

# 2. ESS (Effective Sample Size) — want > 200 for all parameters
makeplot.ess(my_chains, burnin=25)

# 3. Topology trace — are chains exploring the same tree space?
makeplot.topology(my_chains, burnin=25)

# 4. ASDSF (Average Standard Deviation of Split Frequencies)
# The gold standard for topology convergence
# Want < 0.01 (MrBayes reports this automatically)

# 5. Compare posterior distributions between chains
makeplot.pairs(my_chains, burnin=25)

# 6. Autocorrelation — chains should mix well
makeplot.autocorr(my_chains, burnin=25)

Download:

R — CODA Package Diagnostics

library(coda)

# Read parameter log as MCMC object
params <- read.table("chain1.p", header=TRUE, skip=1)
mc <- mcmc(params[-(1:2500),])  # remove burn-in

# Gelman-Rubin diagnostic (need 2+ chains)
params2 <- read.table("chain2.p", header=TRUE, skip=1)
mc2 <- mcmc(params2[-(1:2500),])
mc_list <- mcmc.list(mc, mc2)

gelman.diag(mc_list)
# PSRF (Potential Scale Reduction Factor) should be ~1.0
# Upper 95% CI should be < 1.05

# Geweke diagnostic (single chain stationarity)
geweke.diag(mc)
# |z| < 2 for converged parameters

# Raftery-Lewis diagnostic (how long to run)
raftery.diag(mc, q=0.025, r=0.005, s=0.95)
# Tells you minimum chain length needed

Download:

Dealing with Multimodal Posteriors

Multimodal posteriors are the hidden danger of Bayesian phylogenetics. Your chains may converge to different tree topologies and both look well-mixed. Signs: (1) Two runs give different consensus trees. (2) ASDSF stays high (>0.05) even after millions of generations. (3) Topology trace shows chains jumping between islands. Solutions: Run many (4+) independent chains, use Metropolis-coupled MCMC (MC3) with more heated chains (nchains=8), increase temperature spacing (temp=0.2), or run much longer.

MCMC Convergence Checklist

☑ ESS > 200 for all parameters (Tracer)
☑ PSRF ~ 1.0 for all parameters (Gelman-Rubin / CODA)
☑ ASDSF < 0.01 for topology (MrBayes / RWTY)
☑ Traces show stationarity (no trend) after burn-in
☑ Two independent runs give same topology and parameter estimates
☑ Posterior predictive checks pass (PhyloBayes)

BEAST2 Marginal Likelihood (Path Sampling)

XML snippet — BEAST2

# In BEAUti: set up your analysis normally, then
# Analysis → Use Path Sampling/Stepping Stone
# Or add to XML:
# <run spec="beast.inference.PathSampler"
#   nrOfSteps="100" chainLength="1000000"
#   alpha="0.3" rootdir="/path/to/output"
#   burnInPercentage="50">
#   ... (your standard MCMC spec goes here)
# </run>

# After running, check output for:
# "marginal likelihood (path sampling) = -XXXXX.XX"
# Compare between models: 2*ln(BF) > 10 = strong preference

Thumb Rule — Bayes Factor Interpretation:
2·ln(BF) < 2 = not worth mentioning
2·ln(BF) 2–6 = positive evidence
2·ln(BF) 6–10 = strong evidence
2·ln(BF) > 10 = very strong / decisive evidence

AI Prompt Guide

Advanced Phylogenomic Analysis

I have a phylogenomic dataset with [NUMBER] loci ([DNA/protein]), [NUMBER] taxa, spanning [DIVERGENCE TIME, e.g. >500 Mya]. My specific concerns are: - [Compositional bias / Long branch attraction / Heterotachy / Multimodal posteriors] - Data type: [Amino acid / Nucleotide / Codon] - Computing: HPC with [CORES] CPUs, [RAM] GB, max wall time [HOURS] Please write a complete pipeline that: 1. Tests model adequacy with posterior predictive simulations 2. Runs IQ-TREE with C60/PMSF site-heterogeneous models 3. Runs PhyloBayes CAT-GTR as a cross-check 4. Compares models using stepping-stone marginal likelihoods in MrBayes 5. Assesses MCMC convergence with RWTY and CODA in R 6. Includes SLURM scripts for each step 7. Tells me when GTR+G is sufficient vs when I need CAT

⚠ Troubleshooting

PhyloBayes: Chain runs for weeks with no convergence

CAT-GTR is computationally brutal. For >500 taxa or >50,000 sites, consider: (1) Reducing alignment by removing constant sites, (2) Using PMSF in IQ-TREE as a fast approximation, (3) Using PhyloBayes-MPI with 64+ cores, (4) Running on a subset of the most informative loci.

IQ-TREE: C60 model takes too much memory

C60 with many taxa needs substantial RAM. Try: (1) C20 instead of C60 (much less RAM), (2) The two-step PMSF approach (guide tree first, then C60 on the fixed tree), (3) --safe flag for more conservative memory usage.

MrBayes: ASDSF stuck at 0.05–0.10 for millions of generations

The chains are stuck in different topology islands. Solutions: (1) Increase nchains to 8+ and temp to 0.1–0.2 for more aggressive heating, (2) Run 4+ independent analyses and check for consistent topologies, (3) Consider whether the data genuinely supports multiple topologies (use topology testing — Section 1.7).

Different models give different topologies

This is actually informative, not a bug. If GTR+G and CAT-GTR disagree, the CAT-GTR topology is usually more reliable for deep divergences. The disagreement itself suggests model misspecification under GTR+G. Report both results and discuss model sensitivity in your paper.

Mnemonics & Thumb Rules

Mnemonic

"Deep Divergence Demands Diversity" — 4D Rule

Deep divergences demand site-heterogeneous models (diverse profiles). If your divergences are >500 Mya, GTR+G is not enough. Use C60/PMSF/CAT.

Mnemonic

"Two chains, Two hundred ESS, point-One ASDSF"

The three numbers for MCMC convergence: ≥2 independent runs, ESS ≥200 for all parameters, ASDSF <0.01 for topology. Miss any one = don't publish.

Thumb Rule

Model complexity ladder

Start simple (GTR+G), check adequacy (posterior predictive), escalate if violated (C60 → CAT-GTR). Don't default to the most complex model — over-parameterization wastes time and can reduce accuracy on small datasets.

Thumb Rule — Burn-in: Default 25% burn-in is often insufficient for complex models. Check trace plots — if the chain hasn't stabilized by 25%, increase to 50%. For PhyloBayes CAT, 50% burn-in is standard.

Phylogenetics Intro Module 2: Gene/Species Trees

Phylogenetic Inference — Deep Dive Module 1

On this page

1.1 — Substitution Models & Model Adequacy

Why Standard GTR+G Can Mislead in Deep Phylogenomics

Hierarchy of Substitution Models (Increasing Realism)

Site-Heterogeneous Models in IQ-TREE

CAT / CAT-GTR in PhyloBayes

Posterior Predictive Simulation for Model Adequacy

Heterotachy & Covarion Models

1.2 — Maximum Likelihood at Scale

Partitioned Analysis: Edge-Linked vs Edge-Unlinked

RAxML-NG for Massive Datasets

Constrained Tree Searches

HPC Strategies for Large Phylogenomics

1.3 — Bayesian Inference & MCMC Diagnostics

Marginal Likelihood Estimation for Model Comparison

Never Use Harmonic Mean for Model Comparison

Reversible Jump MCMC for Model Averaging

MCMC Convergence Diagnostics — Beyond Tracer

Dealing with Multimodal Posteriors

MCMC Convergence Checklist

BEAST2 Marginal Likelihood (Path Sampling)

AI Prompt Guide

Advanced Phylogenomic Analysis

⚠ Troubleshooting

PhyloBayes: Chain runs for weeks with no convergence

IQ-TREE: C60 model takes too much memory

MrBayes: ASDSF stuck at 0.05–0.10 for millions of generations

Different models give different topologies

Mnemonics & Thumb Rules