Discrete & Continuous Character Evolution · Biogeography · Ancestral Sequences · dN/dS · Branch-Site Models · Convergent Evolution
Reconstructing the evolutionary history of discrete traits (e.g., habitat, diet, flower color, genome feature presence/absence) on a phylogeny.
library(phytools)
# Load tree and trait data
tree <- read.tree("dated_tree.nwk")
trait <- setNames(c("terrestrial","aquatic","terrestrial","aquatic","terrestrial"),
tree$tip.label)
# Fit transition models
fit_ER <- fitMk(tree, trait, model="ER") # equal rates
fit_ARD <- fitMk(tree, trait, model="ARD") # all rates different
fit_SYM <- fitMk(tree, trait, model="SYM") # symmetric
# Compare with AIC
aic_scores <- c(ER=AIC(fit_ER), ARD=AIC(fit_ARD), SYM=AIC(fit_SYM))
print(aic_scores) # lowest = best
# Stochastic character mapping (SIMMAP)
# Simulates character histories on the tree, accounting for uncertainty
simmaps <- make.simmap(tree, trait, model="ARD", nsim=500)
summary_map <- summary(simmaps)
# Plot with posterior probabilities at nodes
plot(summary_map, fsize=0.7, ftype="i")
add.simmap.legend(colors=setNames(c("brown","blue"),
c("terrestrial","aquatic")), prompt=FALSE)
# Correlated evolution of two binary traits (Pagel 1994)
trait1 <- setNames(c(1,0,1,0,1), tree$tip.label)
trait2 <- setNames(c(1,1,0,0,1), tree$tip.label)
fit_dep <- fitPagel(tree, trait1, trait2, dep.var="xy")
fit_indep <- fitPagel(tree, trait1, trait2, dep.var="x")
# LRT: if dependent model significantly better → correlated evolution
library(corHMM) # Hidden rate models: the transition rate between states # can itself vary across the tree (hidden Markov model) data_df <- data.frame(Taxon=tree$tip.label, Trait=trait) # Standard Mk (1 rate class) fit1 <- corHMM(tree, data_df, rate.cat=1, model="ARD") # Hidden Markov model (2 rate classes) fit2 <- corHMM(tree, data_df, rate.cat=2, model="ARD") # Compare with AIC print(c(fit1$AICc, fit2$AICc)) # If 2-rate class is better, transition rates vary across the tree # (some clades switch states faster than others)
"ER = Equal, SYM = Same both ways, ARD = All Different"
ER: one rate for all transitions. SYM: forward = reverse but different between state pairs. ARD: every transition has its own rate. Start with ER, escalate to ARD only if AIC improves.
| Model | Process | Prediction | Tool |
|---|---|---|---|
| Brownian Motion (BM) | Random walk | Variance increases with time | geiger, phytools |
| Ornstein-Uhlenbeck (OU) | Random walk + pull toward optimum | Trait constrained around adaptive peak | OUwie, bayou, l1ou |
| Early Burst (EB) | Rate decreases exponentially over time | Rapid early divergence then stasis | geiger |
| Multi-regime OU | Different clades pulled to different optima | Adaptive shifts at specific branches | OUwie, surface, bayou |
library(geiger)
library(phytools)
library(OUwie)
tree <- read.tree("dated_tree.nwk")
trait <- setNames(rnorm(length(tree$tip.label)), tree$tip.label)
# Fit models
fit_BM <- fitContinuous(tree, trait, model="BM")
fit_OU <- fitContinuous(tree, trait, model="OU")
fit_EB <- fitContinuous(tree, trait, model="EB")
# Compare with AIC
aics <- c(BM=fit_BM$opt$aicc, OU=fit_OU$opt$aicc, EB=fit_EB$opt$aicc)
print(aics) # lowest wins
# Phylogenetic signal
lambda <- phylosig(tree, trait, method="lambda", test=TRUE)
K <- phylosig(tree, trait, method="K", test=TRUE)
# Lambda: 0 = no signal, 1 = BM expectation
# Blomberg's K: <1 = less signal than BM, >1 = more signal than BM
# PGLS — Phylogenetic Generalized Least Squares
library(caper)
dat <- data.frame(species=tree$tip.label, body_size=trait,
brain_size=rnorm(length(tree$tip.label)))
comp_dat <- comparative.data(tree, dat, names.col="species")
pgls_fit <- pgls(brain_size ~ body_size, data=comp_dat, lambda="ML")
summary(pgls_fit)
# Reports: slope, intercept, estimated lambda, p-value
# Lambda=1 → full phylogenetic correction needed
# Lambda=0 → OLS equivalent (no phylogenetic signal in residuals)
# Multi-regime OU with OUwie
# Requires: tree with mapped regimes (use SIMMAP or painted tree)
# OUwie(tree, data, model="OUMV") # different optima + different sigma²
library(BioGeoBEARS)
library(cladoRcpp)
# Input: dated tree + geographic ranges (presence/absence in areas)
trfn <- "dated_tree.nwk"
geogfn <- "geographic_ranges.txt"
# geographic_ranges.txt format:
# 6 3 (6 taxa, 3 areas)
# TaxonA 110 (present in areas A and B)
# TaxonB 010 (present in area B only)
# TaxonC 001 (present in area C only)
# Set up DEC model
BGB <- define_BioGeoBEARS_run()
BGB$trfn <- trfn
BGB$geogfn <- geogfn
BGB$max_range_size <- 3
BGB$num_cores_to_use <- 4
# Run DEC
BGB$BioGeoBEARS_model_object@params_table["d","est"] <- 0.01
BGB$BioGeoBEARS_model_object@params_table["e","est"] <- 0.01
res_DEC <- bears_optim_run(BGB)
# Run DEC+J (founder-event speciation)
BGB_J <- BGB
BGB_J$BioGeoBEARS_model_object@params_table["j","type"] <- "free"
BGB_J$BioGeoBEARS_model_object@params_table["j","est"] <- 0.01
res_DECJ <- bears_optim_run(BGB_J)
# Run DIVALIKE and BAYAREALIKE (+ J variants)
# ... similar setup with different model specifications
# Compare all 6 models with AICc
AICc_table <- data.frame(
Model = c("DEC","DEC+J","DIVALIKE","DIVALIKE+J","BAYAREALIKE","BAYAREALIKE+J"),
LnL = c(res_DEC$LnL, res_DECJ$LnL), # fill in all 6
AICc = c(res_DEC$AICc, res_DECJ$AICc) # fill in all 6
)
print(AICc_table)
# Plot ancestral ranges on tree
plot_BioGeoBEARS_results(res_DECJ, plotwhat="pie")
The "+J" (jump dispersal / founder-event speciation) parameter in BioGeoBEARS is statistically controversial. Ree & Sanmartín (2018) argued it creates a model comparison framework that unfairly favors +J models. Use with caution: always report both +J and non-J results, and interpret the biological plausibility of founder events in your system.
# PAML: Ancestral sequence reconstruction # codeml.ctl for marginal reconstruction: seqfile = codon_alignment.phy treefile = tree.nwk outfile = results_asr noisy = 3 verbose = 1 runmode = 0 seqtype = 1 * 1=codons CodonFreq = 2 * F3x4 model = 0 * M0 (one omega) NSsites = 0 RateAncestor = 1 * KEY: enables ancestral reconstruction codeml codeml.ctl # Output: rst file with reconstructed ancestral sequences at each node # Reports: marginal probabilities for each site at each node # IQ-TREE: ancestral sequence reconstruction iqtree2 -s alignment.phy -m MFP -asr -T AUTO # Output: .state file with marginal probabilities per site per node # Joint vs Marginal reconstruction: # Marginal: independently reconstructs each node (most probable state) # Joint: finds globally most probable set of ancestral states # Marginal is standard; joint can give different results at ambiguous nodes
| ω Value | Interpretation | Biological Meaning |
|---|---|---|
| ω << 1 | Strong purifying selection | Protein function is highly conserved |
| ω ≈ 1 | Neutral evolution | No selective constraint (pseudogenes, junk) |
| ω > 1 | Positive selection | Amino acid changes are favored (adaptation) |
| ω = 0.1–0.3 | Typical for most genes | Strong constraint with some flexibility |
# PAML yn00: pairwise dN/dS estimation # yn00.ctl: seqfile = codon_alignment.phy outfile = yn00_results verbose = 0 icode = 0 * 0=universal genetic code weession = 0 yn00 yn00.ctl # Output: matrix of pairwise dN, dS, and omega values # KaKs_Calculator (alternative, faster) KaKs_Calculator -i pairs.axt -o results.kaks -m YN # Sliding window dN/dS analysis # Use: DnaSP, or custom R script with seqinr # Reveals which regions of a gene are under different selection pressures
# Label foreground branches in your tree file with #1 # tree: ((A #1, B), (C, D)); # #1 marks the branch leading to A as foreground # codeml.ctl for two-ratio model: seqfile = codon_alignment.phy treefile = labeled_tree.nwk outfile = branch_results noisy = 3 runmode = 0 seqtype = 1 CodonFreq = 2 model = 2 * 2 = two-ratio (foreground vs background) NSsites = 0 fix_omega = 0 omega = 1.5 * initial value codeml codeml.ctl # Also run null model (model = 0, one ratio for all branches) # LRT: 2*(lnL_alt - lnL_null) ~ chi²(df=1) # Significant → foreground lineage has different omega # Free-ratio model (each branch gets its own omega) # model = 1 # Warning: over-parameterized for large trees, use for exploration only # Clade models (CmC, CmD): allow sites to switch selective regimes # between clades. More powerful than branch models for detecting # clade-specific selection shifts at specific sites.
# codeml.ctl for site models: seqfile = codon_alignment.phy treefile = tree.nwk outfile = site_results noisy = 3 runmode = 0 seqtype = 1 CodonFreq = 2 model = 0 * 0 = same tree for all NSsites = 0 1 2 7 8 * run M0, M1a, M2a, M7, M8 fix_omega = 0 omega = 1.5 codeml codeml.ctl # Key comparisons (LRT): # M1a (nearly neutral) vs M2a (selection): tests for ω > 1 sites # df = 2, critical chi² = 5.99 (p=0.05) # M7 (beta) vs M8 (beta + ω>1): more sensitive test # df = 2, critical chi² = 5.99 # M8a (beta + ω=1) vs M8: stricter null # df = 1, critical chi² = 3.84 # If M2a or M8 is significant: # Check BEB (Bayes Empirical Bayes) output for positively selected sites # Sites with P(ω>1) > 0.95 are candidates
# BUSTED: Branch-site Unrestricted Statistical Test for Episodic Diversification # Tests if gene has experienced positive selection on ANY branch hyphy busted --alignment codon_aln.fasta --tree tree.nwk # FEL: Fixed Effects Likelihood (pervasive selection per site) hyphy fel --alignment codon_aln.fasta --tree tree.nwk # MEME: Mixed Effects Model of Evolution (episodic selection per site) # More powerful than FEL — detects selection at specific sites on specific branches hyphy meme --alignment codon_aln.fasta --tree tree.nwk # FUBAR: Fast Unconstrained Bayesian AppRoximation (Bayesian site-level) hyphy fubar --alignment codon_aln.fasta --tree tree.nwk # aBSREL: adaptive Branch-Site REL (selection on each branch) hyphy absrel --alignment codon_aln.fasta --tree tree.nwk # RELAX: test for relaxation or intensification of selection hyphy relax --alignment codon_aln.fasta --tree labeled_tree.nwk \ --test Foreground # Results in JSON format — use HyPhy Vision (vision.hyphy.org) to visualize
| Method | Detects | Scope | Statistical Power |
|---|---|---|---|
| PAML M2a/M8 | Pervasive site-level positive selection | Across entire tree | Moderate (conservative) |
| MEME | Episodic site-level selection | Specific branches + sites | High (most sensitive) |
| BUSTED | Gene-wide positive selection | Gene on any branch | High |
| aBSREL | Branch-specific positive selection | Each branch tested | Moderate |
| FUBAR | Pervasive site-level (Bayesian) | Across entire tree | High for larger datasets |
| FEL | Pervasive site-level (frequentist) | Across entire tree | Lower for small datasets |
"MEME for Moments, FEL for Forever"
MEME detects episodic selection (moments in time — specific branches). FEL detects pervasive selection (forever — across the whole tree). Use both: MEME finds what FEL misses.
# The most popular test for episodic positive selection # Tests: did specific sites experience ω>1 on a specific lineage? # Label foreground branch in tree: ((A #1, B), (C, D)); # Alternative model (model A): # codeml.ctl: model = 2 NSsites = 2 fix_omega = 0 omega = 1.5 # Null model (fix ω₂ = 1): model = 2 NSsites = 2 fix_omega = 1 omega = 1 * fixed at 1 # LRT: 2*(lnL_alt - lnL_null) ~ chi²(df=1), or mix(χ²₀, χ²₁) # Critical value: 2.71 (5% with one-sided test on boundary) # If significant: check BEB for sites with P(class 2a) > 0.95 # These sites experienced ω>1 specifically on the foreground branch # Common pitfall: testing too many foreground branches # Correct for multiple testing (Bonferroni or FDR)
# CodonW: classic tool for codon usage analysis codonw sequences.fasta -all -noblk -nomenu # Output: ENC (effective number of codons), GC3, CAI, RSCU # ENC (Nc): ranges from 20 (max bias) to 61 (no bias) # CAI: Codon Adaptation Index (0-1, higher = more adapted to tRNA pool) # RSCU: Relative Synonymous Codon Usage (>1 = preferred, <1 = avoided) # DAMBE: GUI tool for comprehensive codon usage analysis # Includes: ENC plot (ENC vs GC3), correspondence analysis, Ina's test # Distinguishing mutational bias from translational selection: # Plot ENC vs GC3s (Wright 1990) # Points below the expected curve → translational selection # Points on the curve → purely mutational bias
library(RERconverge)
# RERconverge: detect genes with convergent rate shifts
# associated with a phenotype
# Input: gene trees (one per gene) + species tree + binary trait
trees <- readTrees("gene_trees_dir/")
trait <- setNames(c(1,1,0,0,0,1,0), tree$tip.label)
# 1 = convergent phenotype (e.g., echolocation in bats & dolphins)
# Compute relative evolutionary rates
RERmat <- getAllResiduals(trees, transform="sqrt", weighted=TRUE)
# Correlate rates with phenotype
results <- correlateWithBinaryPhenotype(RERmat, trait, min.sp=10)
# Top genes: convergently accelerated in phenotype-bearing lineages
head(results[order(results$p.adj),], 20)
# PCOC: Profile Changes in Orthologous Clusters
# Detects convergent amino acid substitutions
# pcoc -t tree.nwk -a alignment.fasta -o results/
The optimization is getting stuck. Solutions: (1) Try different initial omega values (0.5, 1.5, 3.0), (2) Check alignment for stop codons or frameshifts, (3) Ensure alignment is in-frame codons, (4) Remove sequences with many gaps. Use cleandata = 1 to remove ambiguous sites.
HyPhy expects in-frame codon alignment. Check: (1) Alignment length is divisible by 3, (2) No internal stop codons, (3) Sequences start at codon position 1. Use MACSE or PAL2NAL to create proper codon alignments from protein alignments.
Common when areas are too numerous (>7 areas, 2^7 = 128 ranges). Solutions: (1) Reduce number of areas by lumping adjacent regions, (2) Set max_range_size to 2 or 3, (3) Use adjacency constraints to prevent biologically implausible ranges.
Multi-regime OU models are notoriously finicky. Solutions: (1) Ensure your mapped regimes are correct (use SIMMAP), (2) Try simpler models first (BMS before OUMV), (3) Increase opts$iterations, (4) Check that trait values are on the right scale (log-transform if needed).
"ω < 1 = Purifying, ω = 1 = Neutral, ω > 1 = Positive" — PNP
The most fundamental equation in molecular evolution. Most genes have ω ≈ 0.1–0.3 (strong purifying selection). ω > 1 at specific sites = positive selection driving amino acid change.
"M7 vs M8 = the Money test"
The M7 (beta) vs M8 (beta+ω>1) comparison in PAML is the most widely used test for positive selection. If significant, look at the BEB output for sites with posterior probability >0.95.