Divergence Time Estimation — Module 3

Molecular Clocks · Fossil Calibration Strategy · Node Dating with BEAST2 · Tip Dating & Total Evidence · MCMCTree & treePL · Rate Variation

~140 min Advanced Bash / R / XML
Byte

2.1 — Molecular Clock Theory

The molecular clock hypothesis assumes that DNA/protein sequences accumulate substitutions at a roughly constant rate over time, allowing divergence dates to be estimated from sequence differences.

Clock ModelAssumptionWhen AppropriateTool
Strict clockAll lineages evolve at the same rateVery closely related taxa (within-species, viruses)BEAST2, MrBayes
Uncorrelated lognormal (UCLN)Rates vary independently across branches from a lognormal distributionMost common choice for divergence datingBEAST2 (default)
Uncorrelated exponentialRates drawn from exponential distributionWhen rate variation is extremeBEAST2
Random local clockRate shifts at specific branchesWhen few lineages have dramatically different ratesBEAST2
Autocorrelated (CIR, UGAM)Parent and child branch rates are correlatedMammals and other groups with generation-time effectsMCMCTree, PhyloBayes

Testing for Clock-Likeness

R
library(ape)
library(phangorn)

# Load tree and alignment
tree <- read.tree("ml_tree.nwk")
alignment <- read.phyDat("alignment.phy", format="phylip")

# Likelihood ratio test: strict clock vs no clock
fit_noclock <- pml(tree, alignment, model="GTR+G(4)")
fit_noclock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
                          rearrangement="none")
fit_clock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
                        rearrangement="none", control=pml.control(
                          epsilon=1e-8, maxit=50, trace=0))

# Root-to-tip regression (for heterochronous data)
library(ape)
tree_rooted <- root(tree, outgroup="outgroup_taxon")
tip_dates <- c(TaxonA=2020, TaxonB=2019, TaxonC=2018)  # sampling dates
distances <- cophenetic.phylo(tree_rooted)
# Plot root-to-tip distance vs sampling date
# Linear relationship = clock-like behavior
# R² > 0.5 = reasonable clock signal

# TempEst (GUI tool) is the standard for root-to-tip regression
# for virus/pathogen phylogenetics
Download:
Mnemonic

"UCLN is the Universal Choice"

When in doubt, use the Uncorrelated Lognormal relaxed clock (UCLN). It's the most widely used and tested. Strict clock is only for viruses or very recent divergences. Autocorrelated clocks are better for mammals (generation time effect) but require careful testing.

2.2 — Fossil Calibration Strategy

Fossil calibrations are the most critical and error-prone step in divergence dating. The accuracy of your dates depends almost entirely on how well you choose and justify your calibration points.

Rules for Good Calibrations
  1. Fossils provide minimum ages, not exact ages. The oldest known fossil of a clade gives the minimum age of that clade. The true origin is older (possibly much older).
  2. Place the fossil at the correct node. A fossil must be assigned to the stem or crown of a clade based on its diagnostic characters. Misplacement = wrong dates everywhere.
  3. Use multiple independent calibrations (3–5 minimum). Never rely on a single calibration point.
  4. Use soft bounds, not hard bounds. Hard bounds (uniform priors) are overly restrictive. Lognormal or exponential priors allow some probability of the true age being older/younger than the fossil.
  5. Cross-validate. Remove each calibration one at a time and re-run. If removing one calibration drastically changes all dates, that calibration may be misplaced.
Prior TypeShapeWhen to UseBEAST2 Syntax
LognormalSkewed right from minimumMost calibrations — allows the true divergence to be somewhat older than the fossilLogNormal(mean, sigma, offset)
ExponentialSharp decay from minimumWhen the fossil record is dense and the oldest fossil is close to the true ageExponential(mean, offset)
UniformFlat between min–maxWhen both a minimum AND a reliable maximum age exist (rare)Uniform(lower, upper)
NormalSymmetric around meanSecondary calibrations from previous studies (not fossils)Normal(mean, sigma)
XML Snippet — BEAST2 Calibration Priors
<!-- Example calibration priors in BEAST2 XML -->

<!-- Calibration 1: Crown angiosperms, oldest fossil ~130 Ma -->
<distribution id="cal_angiosperms" spec="beast.math.distributions.MRCAPrior"
  tree="@Tree.t:alignment" monophyletic="true">
  <taxonset id="angiosperms" spec="TaxonSet">
    <taxon id="Arabidopsis" spec="Taxon"/>
    <taxon id="Oryza" spec="Taxon"/>
    <taxon id="Amborella" spec="Taxon"/>
  </taxonset>
  <!-- Lognormal: offset=130, mean=1.5, sigma=0.5 -->
  <!-- This places 95% CI at 130-175 Ma -->
  <LogNormal name="distr" offset="130.0" meanInRealSpace="true">
    <parameter name="M" value="1.5"/>
    <parameter name="S" value="0.5"/>
  </LogNormal>
</distribution>

<!-- Calibration 2: Crown mammals, oldest fossil ~66 Ma -->
<distribution id="cal_mammals" spec="beast.math.distributions.MRCAPrior"
  tree="@Tree.t:alignment" monophyletic="true">
  <taxonset id="mammals" spec="TaxonSet">
    <taxon id="Homo" spec="Taxon"/>
    <taxon id="Mus" spec="Taxon"/>
  </taxonset>
  <Exponential name="distr" offset="66.0">
    <parameter name="mean" value="10.0"/>
  </Exponential>
</distribution>
Download:

Calibration Resources

Fossil Calibration Database

Curated, peer-reviewed fossil calibrations (fossilcalibrations.org)

Database
Paleobiology Database (PBDB)

Fossil occurrence data with geological ages (paleobiodb.org)

Database
TimeTree

Published divergence times for many clades (timetree.org)

Secondary
Geologic Time Scale 2020

Official epoch/period boundaries for geological context

Reference
The #1 Calibration Mistake

Using the oldest known fossil as the exact age of the node (hard minimum = hard maximum). The fossil record is incomplete — the true origin is almost always older. Use lognormal or exponential priors with the fossil age as the offset (minimum), not the mean.

Thumb Rule — Number of Calibrations: Use 3–10 calibrations spread across the tree. More is generally better, but each must be well-justified. One bad calibration poisons all dates. Cross-validate by leaving-one-out and checking if estimates change dramatically.

2.3 — Node Dating with BEAST2

Full BEAST2 Workflow

Bash — Complete BEAST2 Pipeline
# Step 1: Prepare XML in BEAUti
# - Import alignment(s)
# - Partitions tab: set substitution models per partition
# - Clock tab: select Relaxed Clock Log Normal
# - Priors tab: Tree prior → Birth Death / Yule
# - Priors tab: Add MRCA priors (fossil calibrations)
# - MCMC tab: chain length 100,000,000, log every 10,000

# Step 2: Run BEAST2 on HPC
#!/bin/bash
#SBATCH --job-name=beast2
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=7-00:00:00
#SBATCH --output=beast2_%j.log

module load beast2/2.7.6
module load beagle  # GPU acceleration if available

# With BEAGLE (GPU or SSE acceleration):
beast -threads 16 -beagle_SSE -seed 12345 analysis.xml

# Resume from checkpoint (if wall time exceeded):
beast -resume -threads 16 analysis.xml

# Step 3: Assess convergence
# Open .log file in Tracer
# CHECK: ESS > 200 for ALL parameters
# CHECK: posterior, likelihood, tree height, clock rate

# Step 4: Combine runs (if running multiple independent chains)
logcombiner -log run1.log -log run2.log -o combined.log -burnin 10

# Step 5: Summarize tree distribution
treeannotator -heights median -burnin 10 \
  combined.trees dated_tree.nex

# Step 6: Visualize in FigTree
# Open dated_tree.nex → Display: Node bars (95% HPD)
# Export as SVG/PDF for publication
Download:
Tree PriorAssumptionWhen to Use
YulePure birth (no extinction)Species-level phylogenies, rapid diversification
Birth-DeathBirth + death (turnover)Most divergence dating studies (recommended default)
Coalescent ConstantConstant effective population sizePopulation-level data (within species)
Coalescent Bayesian SkylineNe changes over timeVirus/pathogen phylodynamics
Fossilized Birth-DeathBirth + death + fossil samplingTip dating with fossil taxa (Section 2.4)
Byte warns
BEAST2 MCMC tuning tips:
  • If acceptance rate <15% or >70% for any operator, adjust its weight/window size
  • For large datasets (>500 taxa): use partitioned analysis, BEAGLE GPU, and 108+ generations
  • Always run 2+ independent chains and check convergence between them
  • The ucldStdev parameter measures rate variation: if posterior median ≈ 0, a strict clock is sufficient

2.4 — Tip Dating & Total Evidence Dating

Instead of calibrating internal nodes, tip dating assigns ages directly to fossil taxa included in the analysis. Combined with molecular data from extant taxa and morphological data from fossils, this is total-evidence dating.

Fossilized Birth-Death (FBD) Process

BEAUti Workflow
# Total-evidence dating in BEAST2:
# 1. Install packages: SA (Sampled Ancestors), morph-models
# 2. Data: molecular alignment (extant) + morphological matrix (fossil+extant)
# 3. BEAUti setup:
#    - Import molecular partition (e.g., 5 genes)
#    - Import morphological partition (discrete characters)
#    - Tip dates: assign fossil ages (in Ma)
#      Tip Dates tab → check "Use tip dates"
#      Set each fossil taxon's age (e.g., 66.0 Ma)
#    - Model for morphology: Mk model (Lewis 2001)
#      Lewis Mk with gamma rate variation
#      Set "ascertainmentBias" = conditioning on variable characters
#    - Clock: relaxed log-normal for molecular AND morphological
#    - Tree prior: Fossilized Birth-Death
#      Parameters: diversification rate, turnover, sampling proportion
#    - MCMC: 200,000,000 generations (tip dating needs longer chains)

# Key FBD parameters:
# diversificationRate = speciation − extinction
# turnover = extinction / speciation
# samplingProportion = fossil sampling rate
# rho = extant sampling fraction (e.g., 0.1 if 10% of extant species sampled)
Sampled Ancestor Trees

In the FBD model, fossils can be sampled ancestors (direct ancestors of living taxa, sitting on branches) rather than tips. The SA package in BEAST2 enables this. If a fossil has zero-length branch to its descendant, it's a sampled ancestor. This is biologically realistic — some fossils really are ancestral to living species.

Thumb Rule — Total Evidence: Morphological data matrices should have ≥50 characters for reasonable clock estimation. The Mk model assumes all characters are equally weighted and unordered — use gamma rate variation (+G) to relax this. Always condition on variable characters (ascertainmentBias) because invariant morphological characters are never coded.

2.5 — Alternative Dating Approaches

MCMCTree (PAML) — Approximate Likelihood

Bash
# MCMCTree: fast Bayesian dating using approximate likelihood
# Much faster than BEAST2 for large phylogenomic datasets

# Step 1: Estimate branch lengths with BASEML (fixed topology)
# mcmctree.ctl:
  seqfile = alignment.phy
  treefile = tree_with_calibrations.nwk
  outfile = out
  ndata = 5          * number of partitions
  usedata = 3        * 3 = calculate gradient/Hessian for approx likelihood
  clock = 2          * 2 = independent rates (UCLN-like), 3 = autocorrelated
  model = 4          * 4 = HKY85
  BDparams = 1 1 0.1 * birth, death, sampling
  rgene_gamma = 2 20 * gamma prior on rates
  sigma2_gamma = 1 10 * gamma prior on rate variance

mcmctree mcmctree.ctl

# Step 2: Run MCMC with approximate likelihood
# Change usedata = 2 (use the Hessian from step 1)
# Re-run: mcmctree mcmctree.ctl

# Calibration format in tree file:
# (((A, B) '>0.66<1.30', C), D);
# '>0.66' = minimum 66 Ma, '<1.30' = maximum 130 Ma (in 100 Ma units)
# 'B(0.66, 1.30, 0.025, 0.025)' = soft bounds (recommended)
Download:

treePL — Penalized Likelihood

Bash
# treePL: fast penalized likelihood dating
# Good for large trees (1000+ taxa)

# Config file: treepl.cfg
treefile = ml_tree.nwk
smooth = 100          # smoothing parameter (cross-validate!)
numsites = 50000      # total alignment length
mrca = cal1 TaxonA TaxonB
min = cal1 66
max = cal1 130
mrca = cal2 TaxonC TaxonD
min = cal2 45
max = cal2 90

# Cross-validate smoothing parameter:
thorough
cvstart = 0.001
cvstop = 10000
cvmultstep = 10

# Run
treePL treepl.cfg > treepl_dated.nwk

# Alternative: RelTime in MEGA (very fast, approximate)
# GUI: load alignment + tree, set calibrations
# Good for exploratory analysis, not for final publication dates
Download:
ToolMethodSpeedAccuracyBest For
BEAST2Full Bayesian MCMCVery slowBest (with good calibrations)Final publication dates, <200 taxa
MCMCTreeApprox. likelihood BayesianFastGoodPhylogenomic datasets, 50–500 taxa
treePLPenalized likelihoodVery fastModerateLarge trees (1000+ taxa)
RelTimeRelative rate frameworkInstantApproximateExploratory, quick estimates
r8sPL / Langley-FitchFastModerateLegacy, small datasets

2.6 — Rate Variation Across Lineages

Evolutionary rates vary dramatically across lineages due to differences in generation time, metabolic rate, effective population size, DNA repair efficiency, and life history traits.

R — Visualizing Rate Variation
library(ape)
library(ggtree)
library(treeio)

# Load BEAST2 MCC tree (has rate annotations)
beast_tree <- read.beast("dated_tree.nex")

# Plot with branch rates as color
ggtree(beast_tree, aes(color=rate)) +
  scale_color_viridis_c(name="Substitution rate\n(subs/site/Myr)") +
  geom_tiplab(size=2) +
  theme_tree2() +
  ggtitle("Rate variation across lineages")

# Plot with node age bars (95% HPD)
ggtree(beast_tree, mrsd="2024-01-01") +
  geom_range("height_0.95_HPD", color="steelblue", alpha=0.3, size=2) +
  geom_tiplab(size=2) +
  theme_tree2() +
  scale_x_continuous(labels=abs) +
  xlab("Time (Ma)")

# Test for rate variation: coefficient of variation from BEAST2
# ucldStdev parameter in .log file
# Posterior median ucldStdev < 0.1 → strict clock sufficient
# ucldStdev > 0.5 → substantial rate variation
# ucldStdev > 1.0 → extreme rate variation
Download:
FactorEffect on RateExample
Generation timeShorter generations → faster evolution (per year)Mice evolve faster than elephants
Metabolic rateHigher metabolism → more mutagenic byproductsBirds vs. reptiles
Population size (Ne)Small Ne → more drift, slightly different rate dynamicsIsland vs. mainland populations
DNA repairBetter repair → slower rateUV-exposed organisms may evolve faster
Life historyAnnual vs. perennial plants differ systematicallyHerbaceous plants faster than trees
Mnemonic

"Generation time Governs rate" — GG

The generation time hypothesis explains most rate variation in animals and plants. Short-lived organisms copy their DNA more often per unit time = more mutations per year. This is why mice have ~2× the molecular rate of humans.

AI Prompt Guide

Divergence Time Estimation Pipeline
I need to date a phylogeny of [NUMBER] [taxa type, e.g., flowering plant genera]. I have [NUMBER] loci from [DATA TYPE: transcriptomes / UCEs / RADseq / whole genomes]. My ML tree is from [IQ-TREE / RAxML-NG]. Fossil calibrations I plan to use: - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) Please write a complete pipeline that: 1. Sets up BEAST2 XML with UCLN clock, Birth-Death tree prior 2. Specifies calibration priors as lognormal distributions 3. Includes proper MCMC settings for convergence 4. Provides a SLURM script for HPC 5. Also sets up MCMCTree as a cross-check (faster) 6. Includes R code to visualize the dated tree with 95% HPD bars 7. Advises on cross-validation of calibrations 8. Explains how to interpret ucldStdev for rate variation

Troubleshooting

BEAST2: Posterior and prior ESS are low despite long chains

Common issue with dating analyses. Solutions: (1) Increase chain length to 108 or 109, (2) Use BEAGLE for GPU acceleration, (3) Simplify: use fewer partitions or link clock models, (4) Try MCMCTree instead — its approximate likelihood converges faster.

BEAST2: Dates are unreasonably old or young

Almost always a calibration problem. Check: (1) Are calibration offsets in the right units (Ma, not years)? (2) Is the fossil placed at the correct node? (3) Are prior distributions too wide? Run without data (sampleFromPrior="true") and check if priors alone produce reasonable ages.

MCMCTree: "Hessian not positive definite"

The approximate likelihood failed. Causes: (1) Extremely short branch lengths (collapse near-zero branches), (2) Over-parameterized partition model. Try: reduce number of partitions, use simpler substitution model, or increase finetune values.

treePL: Cross-validation gives no clear optimal smoothing

The data may lack clear clock-like signal. Try: (1) Use a wider range of smoothing values, (2) Check if the tree has very uneven branch lengths (rate variation too extreme for PL), (3) Switch to BEAST2/MCMCTree which handle rate variation better.

Mnemonics & Thumb Rules

Mnemonic

"Fossils give Floors, not Ceilings" — FFC

Fossils provide minimum ages (floors). The true divergence is always older (possibly much older) because the fossil record is incomplete. Use lognormal priors with the fossil age as the offset (floor), not the mean.

Mnemonic

"BEAST for Best, MCMCTree for Many, treePL for Plenty"

BEAST2 = most accurate but slowest. MCMCTree = fast Bayesian for many loci. treePL = fastest for plenty of taxa. Choose based on dataset size and desired rigor.

Thumb Rule — Chain Length: BEAST2 dating analyses typically need 108–109 generations. That's 10–100× longer than a standard tree search. Budget HPC time accordingly (days to weeks).
Thumb Rule — ucldStdev: This BEAST2 parameter measures how much the clock deviates from strict. Posterior median <0.1 = strict clock is fine. 0.1–0.5 = moderate relaxation. >1.0 = extreme rate variation, relaxed clock essential.
Thumb Rule — Prior Sensitivity: Always run BEAST2 with sampleFromPrior="true" (no data) and compare to the posterior. If they look the same for a parameter, the data is not informing that parameter — your result is driven entirely by the prior.