Module 4 — Ancestral States & Molecular Selection

Discrete & Continuous Character Evolution · Biogeography · Ancestral Sequences · dN/dS · Branch-Site Models · Convergent Evolution

~180 min Advanced R / Bash / Python
Byte

3.1 — Discrete Character Evolution

Reconstructing the evolutionary history of discrete traits (e.g., habitat, diet, flower color, genome feature presence/absence) on a phylogeny.

R — phytools: Stochastic Character Mapping
library(phytools)

# Load tree and trait data
tree <- read.tree("dated_tree.nwk")
trait <- setNames(c("terrestrial","aquatic","terrestrial","aquatic","terrestrial"),
                   tree$tip.label)

# Fit transition models
fit_ER  <- fitMk(tree, trait, model="ER")   # equal rates
fit_ARD <- fitMk(tree, trait, model="ARD")  # all rates different
fit_SYM <- fitMk(tree, trait, model="SYM")  # symmetric

# Compare with AIC
aic_scores <- c(ER=AIC(fit_ER), ARD=AIC(fit_ARD), SYM=AIC(fit_SYM))
print(aic_scores)  # lowest = best

# Stochastic character mapping (SIMMAP)
# Simulates character histories on the tree, accounting for uncertainty
simmaps <- make.simmap(tree, trait, model="ARD", nsim=500)
summary_map <- summary(simmaps)

# Plot with posterior probabilities at nodes
plot(summary_map, fsize=0.7, ftype="i")
add.simmap.legend(colors=setNames(c("brown","blue"),
                  c("terrestrial","aquatic")), prompt=FALSE)

# Correlated evolution of two binary traits (Pagel 1994)
trait1 <- setNames(c(1,0,1,0,1), tree$tip.label)
trait2 <- setNames(c(1,1,0,0,1), tree$tip.label)
fit_dep   <- fitPagel(tree, trait1, trait2, dep.var="xy")
fit_indep <- fitPagel(tree, trait1, trait2, dep.var="x")
# LRT: if dependent model significantly better → correlated evolution
Download:

corHMM — Hidden Rate Models

R
library(corHMM)

# Hidden rate models: the transition rate between states
# can itself vary across the tree (hidden Markov model)
data_df <- data.frame(Taxon=tree$tip.label, Trait=trait)

# Standard Mk (1 rate class)
fit1 <- corHMM(tree, data_df, rate.cat=1, model="ARD")

# Hidden Markov model (2 rate classes)
fit2 <- corHMM(tree, data_df, rate.cat=2, model="ARD")

# Compare with AIC
print(c(fit1$AICc, fit2$AICc))
# If 2-rate class is better, transition rates vary across the tree
# (some clades switch states faster than others)
Download:
Mnemonic

"ER = Equal, SYM = Same both ways, ARD = All Different"

ER: one rate for all transitions. SYM: forward = reverse but different between state pairs. ARD: every transition has its own rate. Start with ER, escalate to ARD only if AIC improves.

3.2 — Continuous Character Evolution

ModelProcessPredictionTool
Brownian Motion (BM)Random walkVariance increases with timegeiger, phytools
Ornstein-Uhlenbeck (OU)Random walk + pull toward optimumTrait constrained around adaptive peakOUwie, bayou, l1ou
Early Burst (EB)Rate decreases exponentially over timeRapid early divergence then stasisgeiger
Multi-regime OUDifferent clades pulled to different optimaAdaptive shifts at specific branchesOUwie, surface, bayou
R — Fitting Evolutionary Models
library(geiger)
library(phytools)
library(OUwie)

tree <- read.tree("dated_tree.nwk")
trait <- setNames(rnorm(length(tree$tip.label)), tree$tip.label)

# Fit models
fit_BM <- fitContinuous(tree, trait, model="BM")
fit_OU <- fitContinuous(tree, trait, model="OU")
fit_EB <- fitContinuous(tree, trait, model="EB")

# Compare with AIC
aics <- c(BM=fit_BM$opt$aicc, OU=fit_OU$opt$aicc, EB=fit_EB$opt$aicc)
print(aics)  # lowest wins

# Phylogenetic signal
lambda <- phylosig(tree, trait, method="lambda", test=TRUE)
K <- phylosig(tree, trait, method="K", test=TRUE)
# Lambda: 0 = no signal, 1 = BM expectation
# Blomberg's K: <1 = less signal than BM, >1 = more signal than BM

# PGLS — Phylogenetic Generalized Least Squares
library(caper)
dat <- data.frame(species=tree$tip.label, body_size=trait,
                   brain_size=rnorm(length(tree$tip.label)))
comp_dat <- comparative.data(tree, dat, names.col="species")
pgls_fit <- pgls(brain_size ~ body_size, data=comp_dat, lambda="ML")
summary(pgls_fit)
# Reports: slope, intercept, estimated lambda, p-value
# Lambda=1 → full phylogenetic correction needed
# Lambda=0 → OLS equivalent (no phylogenetic signal in residuals)

# Multi-regime OU with OUwie
# Requires: tree with mapped regimes (use SIMMAP or painted tree)
# OUwie(tree, data, model="OUMV")  # different optima + different sigma²
Download:
Thumb Rule — OU vs BM: If OU is selected over BM, it means the trait is pulled toward an optimum (stabilizing selection). But beware: OU can also fit measurement error or incorrect phylogeny. Always check if alpha (pull strength) has a wide confidence interval — if it does, OU may be overfitting.

3.3 — Biogeographic Ancestral Range Reconstruction

R — BioGeoBEARS
library(BioGeoBEARS)
library(cladoRcpp)

# Input: dated tree + geographic ranges (presence/absence in areas)
trfn <- "dated_tree.nwk"
geogfn <- "geographic_ranges.txt"
# geographic_ranges.txt format:
# 6  3       (6 taxa, 3 areas)
# TaxonA  110   (present in areas A and B)
# TaxonB  010   (present in area B only)
# TaxonC  001   (present in area C only)

# Set up DEC model
BGB <- define_BioGeoBEARS_run()
BGB$trfn <- trfn
BGB$geogfn <- geogfn
BGB$max_range_size <- 3
BGB$num_cores_to_use <- 4

# Run DEC
BGB$BioGeoBEARS_model_object@params_table["d","est"] <- 0.01
BGB$BioGeoBEARS_model_object@params_table["e","est"] <- 0.01
res_DEC <- bears_optim_run(BGB)

# Run DEC+J (founder-event speciation)
BGB_J <- BGB
BGB_J$BioGeoBEARS_model_object@params_table["j","type"] <- "free"
BGB_J$BioGeoBEARS_model_object@params_table["j","est"] <- 0.01
res_DECJ <- bears_optim_run(BGB_J)

# Run DIVALIKE and BAYAREALIKE (+ J variants)
# ... similar setup with different model specifications

# Compare all 6 models with AICc
AICc_table <- data.frame(
  Model = c("DEC","DEC+J","DIVALIKE","DIVALIKE+J","BAYAREALIKE","BAYAREALIKE+J"),
  LnL = c(res_DEC$LnL, res_DECJ$LnL),  # fill in all 6
  AICc = c(res_DEC$AICc, res_DECJ$AICc)  # fill in all 6
)
print(AICc_table)

# Plot ancestral ranges on tree
plot_BioGeoBEARS_results(res_DECJ, plotwhat="pie")
Download:
The +J Parameter Controversy

The "+J" (jump dispersal / founder-event speciation) parameter in BioGeoBEARS is statistically controversial. Ree & Sanmartín (2018) argued it creates a model comparison framework that unfairly favors +J models. Use with caution: always report both +J and non-J results, and interpret the biological plausibility of founder events in your system.

3.4 — Ancestral Sequence Reconstruction

Bash — PAML & IQ-TREE
# PAML: Ancestral sequence reconstruction
# codeml.ctl for marginal reconstruction:
  seqfile = codon_alignment.phy
  treefile = tree.nwk
  outfile = results_asr
  noisy = 3
  verbose = 1
  runmode = 0
  seqtype = 1     * 1=codons
  CodonFreq = 2   * F3x4
  model = 0       * M0 (one omega)
  NSsites = 0
  RateAncestor = 1  * KEY: enables ancestral reconstruction
  
codeml codeml.ctl
# Output: rst file with reconstructed ancestral sequences at each node
# Reports: marginal probabilities for each site at each node

# IQ-TREE: ancestral sequence reconstruction
iqtree2 -s alignment.phy -m MFP -asr -T AUTO
# Output: .state file with marginal probabilities per site per node

# Joint vs Marginal reconstruction:
# Marginal: independently reconstructs each node (most probable state)
# Joint: finds globally most probable set of ancestral states
# Marginal is standard; joint can give different results at ambiguous nodes
Download:
Byte
Applications of Ancestral Sequence Reconstruction:
  • Ancestral protein resurrection: Synthesize and experimentally characterize ancestral proteins (Lazarus, FireProt)
  • Directed evolution: Use ancestral sequences as starting points for protein engineering
  • Functional inference: Identify which mutations caused functional changes
  • Vaccine design: Reconstruct ancestral viral proteins as broadly-reactive immunogens

4.1 — dN/dS (ω) Analysis Fundamentals

ω ValueInterpretationBiological Meaning
ω << 1Strong purifying selectionProtein function is highly conserved
ω ≈ 1Neutral evolutionNo selective constraint (pseudogenes, junk)
ω > 1Positive selectionAmino acid changes are favored (adaptation)
ω = 0.1–0.3Typical for most genesStrong constraint with some flexibility
Bash — Pairwise dN/dS
# PAML yn00: pairwise dN/dS estimation
# yn00.ctl:
  seqfile = codon_alignment.phy
  outfile = yn00_results
  verbose = 0
  icode = 0       * 0=universal genetic code
  weession = 0

yn00 yn00.ctl
# Output: matrix of pairwise dN, dS, and omega values

# KaKs_Calculator (alternative, faster)
KaKs_Calculator -i pairs.axt -o results.kaks -m YN

# Sliding window dN/dS analysis
# Use: DnaSP, or custom R script with seqinr
# Reveals which regions of a gene are under different selection pressures
Thumb Rule — dN/dS Caveats: Gene-wide ω rarely exceeds 1, even in genes under strong positive selection. Why? Most sites are conserved (ω<1), which averages down the gene-wide ω. To detect positive selection, you need site models (Section 4.3) that allow ω to vary across sites.

4.2 — Branch Models & Clade Models (PAML)

Bash — PAML codeml Branch Models
# Label foreground branches in your tree file with #1
# tree: ((A #1, B), (C, D));
# #1 marks the branch leading to A as foreground

# codeml.ctl for two-ratio model:
  seqfile = codon_alignment.phy
  treefile = labeled_tree.nwk
  outfile = branch_results
  noisy = 3
  runmode = 0
  seqtype = 1
  CodonFreq = 2
  model = 2      * 2 = two-ratio (foreground vs background)
  NSsites = 0
  fix_omega = 0
  omega = 1.5    * initial value

codeml codeml.ctl

# Also run null model (model = 0, one ratio for all branches)
# LRT: 2*(lnL_alt - lnL_null) ~ chi²(df=1)
# Significant → foreground lineage has different omega

# Free-ratio model (each branch gets its own omega)
# model = 1
# Warning: over-parameterized for large trees, use for exploration only

# Clade models (CmC, CmD): allow sites to switch selective regimes
# between clades. More powerful than branch models for detecting
# clade-specific selection shifts at specific sites.
Download:

4.3 — Site Models & Detecting Positive Selection

Bash — PAML codeml Site Models
# codeml.ctl for site models:
  seqfile = codon_alignment.phy
  treefile = tree.nwk
  outfile = site_results
  noisy = 3
  runmode = 0
  seqtype = 1
  CodonFreq = 2
  model = 0       * 0 = same tree for all
  NSsites = 0 1 2 7 8  * run M0, M1a, M2a, M7, M8
  fix_omega = 0
  omega = 1.5

codeml codeml.ctl

# Key comparisons (LRT):
# M1a (nearly neutral) vs M2a (selection): tests for ω > 1 sites
#   df = 2, critical chi² = 5.99 (p=0.05)
# M7 (beta) vs M8 (beta + ω>1): more sensitive test
#   df = 2, critical chi² = 5.99
# M8a (beta + ω=1) vs M8: stricter null
#   df = 1, critical chi² = 3.84

# If M2a or M8 is significant:
# Check BEB (Bayes Empirical Bayes) output for positively selected sites
# Sites with P(ω>1) > 0.95 are candidates
Download:

HyPhy — Modern Selection Analysis

Bash — HyPhy Suite
# BUSTED: Branch-site Unrestricted Statistical Test for Episodic Diversification
# Tests if gene has experienced positive selection on ANY branch
hyphy busted --alignment codon_aln.fasta --tree tree.nwk

# FEL: Fixed Effects Likelihood (pervasive selection per site)
hyphy fel --alignment codon_aln.fasta --tree tree.nwk

# MEME: Mixed Effects Model of Evolution (episodic selection per site)
# More powerful than FEL — detects selection at specific sites on specific branches
hyphy meme --alignment codon_aln.fasta --tree tree.nwk

# FUBAR: Fast Unconstrained Bayesian AppRoximation (Bayesian site-level)
hyphy fubar --alignment codon_aln.fasta --tree tree.nwk

# aBSREL: adaptive Branch-Site REL (selection on each branch)
hyphy absrel --alignment codon_aln.fasta --tree tree.nwk

# RELAX: test for relaxation or intensification of selection
hyphy relax --alignment codon_aln.fasta --tree labeled_tree.nwk \
  --test Foreground

# Results in JSON format — use HyPhy Vision (vision.hyphy.org) to visualize
Download:
MethodDetectsScopeStatistical Power
PAML M2a/M8Pervasive site-level positive selectionAcross entire treeModerate (conservative)
MEMEEpisodic site-level selectionSpecific branches + sitesHigh (most sensitive)
BUSTEDGene-wide positive selectionGene on any branchHigh
aBSRELBranch-specific positive selectionEach branch testedModerate
FUBARPervasive site-level (Bayesian)Across entire treeHigh for larger datasets
FELPervasive site-level (frequentist)Across entire treeLower for small datasets
Mnemonic

"MEME for Moments, FEL for Forever"

MEME detects episodic selection (moments in time — specific branches). FEL detects pervasive selection (forever — across the whole tree). Use both: MEME finds what FEL misses.

4.4 — Branch-Site Models

Bash — PAML Branch-Site Model A
# The most popular test for episodic positive selection
# Tests: did specific sites experience ω>1 on a specific lineage?

# Label foreground branch in tree: ((A #1, B), (C, D));

# Alternative model (model A):
# codeml.ctl:
  model = 2
  NSsites = 2
  fix_omega = 0
  omega = 1.5

# Null model (fix ω₂ = 1):
  model = 2
  NSsites = 2
  fix_omega = 1
  omega = 1        * fixed at 1

# LRT: 2*(lnL_alt - lnL_null) ~ chi²(df=1), or mix(χ²₀, χ²₁)
# Critical value: 2.71 (5% with one-sided test on boundary)

# If significant: check BEB for sites with P(class 2a) > 0.95
# These sites experienced ω>1 specifically on the foreground branch

# Common pitfall: testing too many foreground branches
# Correct for multiple testing (Bonferroni or FDR)
Download:

4.5 — Codon Usage Bias & Translational Selection

Bash
# CodonW: classic tool for codon usage analysis
codonw sequences.fasta -all -noblk -nomenu
# Output: ENC (effective number of codons), GC3, CAI, RSCU

# ENC (Nc): ranges from 20 (max bias) to 61 (no bias)
# CAI: Codon Adaptation Index (0-1, higher = more adapted to tRNA pool)
# RSCU: Relative Synonymous Codon Usage (>1 = preferred, <1 = avoided)

# DAMBE: GUI tool for comprehensive codon usage analysis
# Includes: ENC plot (ENC vs GC3), correspondence analysis, Ina's test

# Distinguishing mutational bias from translational selection:
# Plot ENC vs GC3s (Wright 1990)
# Points below the expected curve → translational selection
# Points on the curve → purely mutational bias

4.7 — Convergent Molecular Evolution

R — RERconverge
library(RERconverge)

# RERconverge: detect genes with convergent rate shifts
# associated with a phenotype

# Input: gene trees (one per gene) + species tree + binary trait
trees <- readTrees("gene_trees_dir/")
trait <- setNames(c(1,1,0,0,0,1,0), tree$tip.label)
# 1 = convergent phenotype (e.g., echolocation in bats & dolphins)

# Compute relative evolutionary rates
RERmat <- getAllResiduals(trees, transform="sqrt", weighted=TRUE)

# Correlate rates with phenotype
results <- correlateWithBinaryPhenotype(RERmat, trait, min.sp=10)

# Top genes: convergently accelerated in phenotype-bearing lineages
head(results[order(results$p.adj),], 20)

# PCOC: Profile Changes in Orthologous Clusters
# Detects convergent amino acid substitutions
# pcoc -t tree.nwk -a alignment.fasta -o results/
Download:

AI Prompt Guide

Character Evolution & Selection Analysis
I have a dated phylogeny of [NUMBER] [taxon type] and want to: PART A — Ancestral state reconstruction: - Trait type: [discrete: specify states / continuous: specify units] - I want to test [correlated evolution / adaptive shift / biogeographic reconstruction] - Tools preferred: [phytools / corHMM / BioGeoBEARS / OUwie] PART B — Selection analysis: - I have codon alignments for [NUMBER] genes - Foreground lineage: [SPECIES/CLADE] - I want to test for [episodic positive selection / relaxation / convergent evolution] - Run both PAML branch-site and HyPhy (BUSTED, MEME, aBSREL) Please write complete R and Bash scripts for both parts, including: 1. Model fitting and comparison (AIC or LRT) 2. Proper null model specification for each test 3. Multiple testing correction 4. Publication-quality visualization 5. Interpretation guidelines for each result

Troubleshooting

PAML codeml: omega stuck at boundary (0 or 999)

The optimization is getting stuck. Solutions: (1) Try different initial omega values (0.5, 1.5, 3.0), (2) Check alignment for stop codons or frameshifts, (3) Ensure alignment is in-frame codons, (4) Remove sequences with many gaps. Use cleandata = 1 to remove ambiguous sites.

HyPhy: "This alignment does not appear to be a coding alignment"

HyPhy expects in-frame codon alignment. Check: (1) Alignment length is divisible by 3, (2) No internal stop codons, (3) Sequences start at codon position 1. Use MACSE or PAL2NAL to create proper codon alignments from protein alignments.

BioGeoBEARS: "Error in optim"

Common when areas are too numerous (>7 areas, 2^7 = 128 ranges). Solutions: (1) Reduce number of areas by lumping adjacent regions, (2) Set max_range_size to 2 or 3, (3) Use adjacency constraints to prevent biologically implausible ranges.

OUwie: "theta outside bounds" or convergence failures

Multi-regime OU models are notoriously finicky. Solutions: (1) Ensure your mapped regimes are correct (use SIMMAP), (2) Try simpler models first (BMS before OUMV), (3) Increase opts$iterations, (4) Check that trait values are on the right scale (log-transform if needed).

Mnemonics & Thumb Rules

Mnemonic

"ω < 1 = Purifying, ω = 1 = Neutral, ω > 1 = Positive" — PNP

The most fundamental equation in molecular evolution. Most genes have ω ≈ 0.1–0.3 (strong purifying selection). ω > 1 at specific sites = positive selection driving amino acid change.

Mnemonic

"M7 vs M8 = the Money test"

The M7 (beta) vs M8 (beta+ω>1) comparison in PAML is the most widely used test for positive selection. If significant, look at the BEB output for sites with posterior probability >0.95.

Thumb Rule — Alignment Quality for dN/dS: dN/dS analysis is extremely sensitive to alignment errors. A single misaligned codon can create a false signal of positive selection. Always: (1) align proteins first, then back-translate to codons (PAL2NAL), (2) remove poorly aligned regions (Gblocks/trimAl), (3) check for recombination (GARD in HyPhy) before running selection tests.
Thumb Rule — Lambda & K: Pagel's λ close to 1 = strong phylogenetic signal (use PGLS). Blomberg's K > 1 = trait is more conserved than BM expectation. K < 1 = convergent evolution or measurement error. Always report both.