Divergence Time Estimation — Module 3: Molecular Clocks, Fossil Calibration, BEAST2, Tip Dating

2.1 — Molecular Clock Theory

The molecular clock hypothesis assumes that DNA/protein sequences accumulate substitutions at a roughly constant rate over time, allowing divergence dates to be estimated from sequence differences.

Clock Model	Assumption	When Appropriate	Tool
Strict clock	All lineages evolve at the same rate	Very closely related taxa (within-species, viruses)	BEAST2, MrBayes
Uncorrelated lognormal (UCLN)	Rates vary independently across branches from a lognormal distribution	Most common choice for divergence dating	BEAST2 (default)
Uncorrelated exponential	Rates drawn from exponential distribution	When rate variation is extreme	BEAST2
Random local clock	Rate shifts at specific branches	When few lineages have dramatically different rates	BEAST2
Autocorrelated (CIR, UGAM)	Parent and child branch rates are correlated	Mammals and other groups with generation-time effects	MCMCTree, PhyloBayes

Testing for Clock-Likeness

library(ape)
library(phangorn)

# Load tree and alignment
tree <- read.tree("ml_tree.nwk")
alignment <- read.phyDat("alignment.phy", format="phylip")

# Likelihood ratio test: strict clock vs no clock
fit_noclock <- pml(tree, alignment, model="GTR+G(4)")
fit_noclock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
                          rearrangement="none")
fit_clock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
                        rearrangement="none", control=pml.control(
                          epsilon=1e-8, maxit=50, trace=0))

# Root-to-tip regression (for heterochronous data)
library(ape)
tree_rooted <- root(tree, outgroup="outgroup_taxon")
tip_dates <- c(TaxonA=2020, TaxonB=2019, TaxonC=2018)  # sampling dates
distances <- cophenetic.phylo(tree_rooted)
# Plot root-to-tip distance vs sampling date
# Linear relationship = clock-like behavior
# R² > 0.5 = reasonable clock signal

# TempEst (GUI tool) is the standard for root-to-tip regression
# for virus/pathogen phylogenetics

Download:

Mnemonic

"UCLN is the Universal Choice"

When in doubt, use the Uncorrelated Lognormal relaxed clock (UCLN). It's the most widely used and tested. Strict clock is only for viruses or very recent divergences. Autocorrelated clocks are better for mammals (generation time effect) but require careful testing.

2.2 — Fossil Calibration Strategy

Fossil calibrations are the most critical and error-prone step in divergence dating. The accuracy of your dates depends almost entirely on how well you choose and justify your calibration points.

Rules for Good Calibrations

Fossils provide minimum ages, not exact ages. The oldest known fossil of a clade gives the minimum age of that clade. The true origin is older (possibly much older).
Place the fossil at the correct node. A fossil must be assigned to the stem or crown of a clade based on its diagnostic characters. Misplacement = wrong dates everywhere.
Use multiple independent calibrations (3–5 minimum). Never rely on a single calibration point.
Use soft bounds, not hard bounds. Hard bounds (uniform priors) are overly restrictive. Lognormal or exponential priors allow some probability of the true age being older/younger than the fossil.
Cross-validate. Remove each calibration one at a time and re-run. If removing one calibration drastically changes all dates, that calibration may be misplaced.

Prior Type	Shape	When to Use	BEAST2 Syntax
Lognormal	Skewed right from minimum	Most calibrations — allows the true divergence to be somewhat older than the fossil	`LogNormal(mean, sigma, offset)`
Exponential	Sharp decay from minimum	When the fossil record is dense and the oldest fossil is close to the true age	`Exponential(mean, offset)`
Uniform	Flat between min–max	When both a minimum AND a reliable maximum age exist (rare)	`Uniform(lower, upper)`
Normal	Symmetric around mean	Secondary calibrations from previous studies (not fossils)	`Normal(mean, sigma)`

XML Snippet — BEAST2 Calibration Priors

<!-- Example calibration priors in BEAST2 XML -->

<!-- Calibration 1: Crown angiosperms, oldest fossil ~130 Ma -->
<distribution id="cal_angiosperms" spec="beast.math.distributions.MRCAPrior"
  tree="@Tree.t:alignment" monophyletic="true">
  <taxonset id="angiosperms" spec="TaxonSet">
    <taxon id="Arabidopsis" spec="Taxon"/>
    <taxon id="Oryza" spec="Taxon"/>
    <taxon id="Amborella" spec="Taxon"/>
  </taxonset>
  <!-- Lognormal: offset=130, mean=1.5, sigma=0.5 -->
  <!-- This places 95% CI at 130-175 Ma -->
  <LogNormal name="distr" offset="130.0" meanInRealSpace="true">
    <parameter name="M" value="1.5"/>
    <parameter name="S" value="0.5"/>
  </LogNormal>
</distribution>

<!-- Calibration 2: Crown mammals, oldest fossil ~66 Ma -->
<distribution id="cal_mammals" spec="beast.math.distributions.MRCAPrior"
  tree="@Tree.t:alignment" monophyletic="true">
  <taxonset id="mammals" spec="TaxonSet">
    <taxon id="Homo" spec="Taxon"/>
    <taxon id="Mus" spec="Taxon"/>
  </taxonset>
  <Exponential name="distr" offset="66.0">
    <parameter name="mean" value="10.0"/>
  </Exponential>
</distribution>

Download:

Calibration Resources

Fossil Calibration Database

Curated, peer-reviewed fossil calibrations (fossilcalibrations.org)

Database

Paleobiology Database (PBDB)

Fossil occurrence data with geological ages (paleobiodb.org)

Database

TimeTree

Published divergence times for many clades (timetree.org)

Secondary

Geologic Time Scale 2020

Official epoch/period boundaries for geological context

Reference

The #1 Calibration Mistake

Using the oldest known fossil as the exact age of the node (hard minimum = hard maximum). The fossil record is incomplete — the true origin is almost always older. Use lognormal or exponential priors with the fossil age as the offset (minimum), not the mean.

Thumb Rule — Number of Calibrations: Use 3–10 calibrations spread across the tree. More is generally better, but each must be well-justified. One bad calibration poisons all dates. Cross-validate by leaving-one-out and checking if estimates change dramatically.

2.3 — Node Dating with BEAST2

Full BEAST2 Workflow

Bash — Complete BEAST2 Pipeline

# Step 1: Prepare XML in BEAUti
# - Import alignment(s)
# - Partitions tab: set substitution models per partition
# - Clock tab: select Relaxed Clock Log Normal
# - Priors tab: Tree prior → Birth Death / Yule
# - Priors tab: Add MRCA priors (fossil calibrations)
# - MCMC tab: chain length 100,000,000, log every 10,000

# Step 2: Run BEAST2 on HPC
#!/bin/bash
#SBATCH --job-name=beast2
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=7-00:00:00
#SBATCH --output=beast2_%j.log

module load beast2/2.7.6
module load beagle  # GPU acceleration if available

# With BEAGLE (GPU or SSE acceleration):
beast -threads 16 -beagle_SSE -seed 12345 analysis.xml

# Resume from checkpoint (if wall time exceeded):
beast -resume -threads 16 analysis.xml

# Step 3: Assess convergence
# Open .log file in Tracer
# CHECK: ESS > 200 for ALL parameters
# CHECK: posterior, likelihood, tree height, clock rate

# Step 4: Combine runs (if running multiple independent chains)
logcombiner -log run1.log -log run2.log -o combined.log -burnin 10

# Step 5: Summarize tree distribution
treeannotator -heights median -burnin 10 \
  combined.trees dated_tree.nex

# Step 6: Visualize in FigTree
# Open dated_tree.nex → Display: Node bars (95% HPD)
# Export as SVG/PDF for publication

Download:

Tree Prior	Assumption	When to Use
Yule	Pure birth (no extinction)	Species-level phylogenies, rapid diversification
Birth-Death	Birth + death (turnover)	Most divergence dating studies (recommended default)
Coalescent Constant	Constant effective population size	Population-level data (within species)
Coalescent Bayesian Skyline	Ne changes over time	Virus/pathogen phylodynamics
Fossilized Birth-Death	Birth + death + fossil sampling	Tip dating with fossil taxa (Section 2.4)

BEAST2 MCMC tuning tips:

If acceptance rate <15% or >70% for any operator, adjust its weight/window size
For large datasets (>500 taxa): use partitioned analysis, BEAGLE GPU, and 10⁸+ generations
Always run 2+ independent chains and check convergence between them
The ucldStdev parameter measures rate variation: if posterior median ≈ 0, a strict clock is sufficient

2.4 — Tip Dating & Total Evidence Dating

Instead of calibrating internal nodes, tip dating assigns ages directly to fossil taxa included in the analysis. Combined with molecular data from extant taxa and morphological data from fossils, this is total-evidence dating.

Fossilized Birth-Death (FBD) Process

BEAUti Workflow

# Total-evidence dating in BEAST2:
# 1. Install packages: SA (Sampled Ancestors), morph-models
# 2. Data: molecular alignment (extant) + morphological matrix (fossil+extant)
# 3. BEAUti setup:
#    - Import molecular partition (e.g., 5 genes)
#    - Import morphological partition (discrete characters)
#    - Tip dates: assign fossil ages (in Ma)
#      Tip Dates tab → check "Use tip dates"
#      Set each fossil taxon's age (e.g., 66.0 Ma)
#    - Model for morphology: Mk model (Lewis 2001)
#      Lewis Mk with gamma rate variation
#      Set "ascertainmentBias" = conditioning on variable characters
#    - Clock: relaxed log-normal for molecular AND morphological
#    - Tree prior: Fossilized Birth-Death
#      Parameters: diversification rate, turnover, sampling proportion
#    - MCMC: 200,000,000 generations (tip dating needs longer chains)

# Key FBD parameters:
# diversificationRate = speciation − extinction
# turnover = extinction / speciation
# samplingProportion = fossil sampling rate
# rho = extant sampling fraction (e.g., 0.1 if 10% of extant species sampled)

Sampled Ancestor Trees

In the FBD model, fossils can be sampled ancestors (direct ancestors of living taxa, sitting on branches) rather than tips. The SA package in BEAST2 enables this. If a fossil has zero-length branch to its descendant, it's a sampled ancestor. This is biologically realistic — some fossils really are ancestral to living species.

Thumb Rule — Total Evidence: Morphological data matrices should have ≥50 characters for reasonable clock estimation. The Mk model assumes all characters are equally weighted and unordered — use gamma rate variation (+G) to relax this. Always condition on variable characters (ascertainmentBias) because invariant morphological characters are never coded.

2.5 — Alternative Dating Approaches

MCMCTree (PAML) — Approximate Likelihood

Bash

# MCMCTree: fast Bayesian dating using approximate likelihood
# Much faster than BEAST2 for large phylogenomic datasets

# Step 1: Estimate branch lengths with BASEML (fixed topology)
# mcmctree.ctl:
  seqfile = alignment.phy
  treefile = tree_with_calibrations.nwk
  outfile = out
  ndata = 5          * number of partitions
  usedata = 3        * 3 = calculate gradient/Hessian for approx likelihood
  clock = 2          * 2 = independent rates (UCLN-like), 3 = autocorrelated
  model = 4          * 4 = HKY85
  BDparams = 1 1 0.1 * birth, death, sampling
  rgene_gamma = 2 20 * gamma prior on rates
  sigma2_gamma = 1 10 * gamma prior on rate variance

mcmctree mcmctree.ctl

# Step 2: Run MCMC with approximate likelihood
# Change usedata = 2 (use the Hessian from step 1)
# Re-run: mcmctree mcmctree.ctl

# Calibration format in tree file:
# (((A, B) '>0.66<1.30', C), D);
# '>0.66' = minimum 66 Ma, '<1.30' = maximum 130 Ma (in 100 Ma units)
# 'B(0.66, 1.30, 0.025, 0.025)' = soft bounds (recommended)

Download:

treePL — Penalized Likelihood

Bash

# treePL: fast penalized likelihood dating
# Good for large trees (1000+ taxa)

# Config file: treepl.cfg
treefile = ml_tree.nwk
smooth = 100          # smoothing parameter (cross-validate!)
numsites = 50000      # total alignment length
mrca = cal1 TaxonA TaxonB
min = cal1 66
max = cal1 130
mrca = cal2 TaxonC TaxonD
min = cal2 45
max = cal2 90

# Cross-validate smoothing parameter:
thorough
cvstart = 0.001
cvstop = 10000
cvmultstep = 10

# Run
treePL treepl.cfg > treepl_dated.nwk

# Alternative: RelTime in MEGA (very fast, approximate)
# GUI: load alignment + tree, set calibrations
# Good for exploratory analysis, not for final publication dates

Download:

Tool	Method	Speed	Accuracy	Best For
BEAST2	Full Bayesian MCMC	Very slow	Best (with good calibrations)	Final publication dates, <200 taxa
MCMCTree	Approx. likelihood Bayesian	Fast	Good	Phylogenomic datasets, 50–500 taxa
treePL	Penalized likelihood	Very fast	Moderate	Large trees (1000+ taxa)
RelTime	Relative rate framework	Instant	Approximate	Exploratory, quick estimates
r8s	PL / Langley-Fitch	Fast	Moderate	Legacy, small datasets

2.6 — Rate Variation Across Lineages

Evolutionary rates vary dramatically across lineages due to differences in generation time, metabolic rate, effective population size, DNA repair efficiency, and life history traits.

R — Visualizing Rate Variation

library(ape)
library(ggtree)
library(treeio)

# Load BEAST2 MCC tree (has rate annotations)
beast_tree <- read.beast("dated_tree.nex")

# Plot with branch rates as color
ggtree(beast_tree, aes(color=rate)) +
  scale_color_viridis_c(name="Substitution rate\n(subs/site/Myr)") +
  geom_tiplab(size=2) +
  theme_tree2() +
  ggtitle("Rate variation across lineages")

# Plot with node age bars (95% HPD)
ggtree(beast_tree, mrsd="2024-01-01") +
  geom_range("height_0.95_HPD", color="steelblue", alpha=0.3, size=2) +
  geom_tiplab(size=2) +
  theme_tree2() +
  scale_x_continuous(labels=abs) +
  xlab("Time (Ma)")

# Test for rate variation: coefficient of variation from BEAST2
# ucldStdev parameter in .log file
# Posterior median ucldStdev < 0.1 → strict clock sufficient
# ucldStdev > 0.5 → substantial rate variation
# ucldStdev > 1.0 → extreme rate variation

Download:

Factor	Effect on Rate	Example
Generation time	Shorter generations → faster evolution (per year)	Mice evolve faster than elephants
Metabolic rate	Higher metabolism → more mutagenic byproducts	Birds vs. reptiles
Population size (Ne)	Small Ne → more drift, slightly different rate dynamics	Island vs. mainland populations
DNA repair	Better repair → slower rate	UV-exposed organisms may evolve faster
Life history	Annual vs. perennial plants differ systematically	Herbaceous plants faster than trees

Mnemonic

"Generation time Governs rate" — GG

The generation time hypothesis explains most rate variation in animals and plants. Short-lived organisms copy their DNA more often per unit time = more mutations per year. This is why mice have ~2× the molecular rate of humans.

AI Prompt Guide

Divergence Time Estimation Pipeline

I need to date a phylogeny of [NUMBER] [taxa type, e.g., flowering plant genera]. I have [NUMBER] loci from [DATA TYPE: transcriptomes / UCEs / RADseq / whole genomes]. My ML tree is from [IQ-TREE / RAxML-NG]. Fossil calibrations I plan to use: - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) - [NODE]: minimum [AGE] Ma, from [FOSSIL NAME] ([REFERENCE]) Please write a complete pipeline that: 1. Sets up BEAST2 XML with UCLN clock, Birth-Death tree prior 2. Specifies calibration priors as lognormal distributions 3. Includes proper MCMC settings for convergence 4. Provides a SLURM script for HPC 5. Also sets up MCMCTree as a cross-check (faster) 6. Includes R code to visualize the dated tree with 95% HPD bars 7. Advises on cross-validation of calibrations 8. Explains how to interpret ucldStdev for rate variation

⚠ Troubleshooting

BEAST2: Posterior and prior ESS are low despite long chains

Common issue with dating analyses. Solutions: (1) Increase chain length to 10⁸ or 10⁹, (2) Use BEAGLE for GPU acceleration, (3) Simplify: use fewer partitions or link clock models, (4) Try MCMCTree instead — its approximate likelihood converges faster.

BEAST2: Dates are unreasonably old or young

Almost always a calibration problem. Check: (1) Are calibration offsets in the right units (Ma, not years)? (2) Is the fossil placed at the correct node? (3) Are prior distributions too wide? Run without data (sampleFromPrior="true") and check if priors alone produce reasonable ages.

MCMCTree: "Hessian not positive definite"

The approximate likelihood failed. Causes: (1) Extremely short branch lengths (collapse near-zero branches), (2) Over-parameterized partition model. Try: reduce number of partitions, use simpler substitution model, or increase finetune values.

treePL: Cross-validation gives no clear optimal smoothing

The data may lack clear clock-like signal. Try: (1) Use a wider range of smoothing values, (2) Check if the tree has very uneven branch lengths (rate variation too extreme for PL), (3) Switch to BEAST2/MCMCTree which handle rate variation better.

Mnemonics & Thumb Rules

Mnemonic

"Fossils give Floors, not Ceilings" — FFC

Fossils provide minimum ages (floors). The true divergence is always older (possibly much older) because the fossil record is incomplete. Use lognormal priors with the fossil age as the offset (floor), not the mean.

Mnemonic

"BEAST for Best, MCMCTree for Many, treePL for Plenty"

BEAST2 = most accurate but slowest. MCMCTree = fast Bayesian for many loci. treePL = fastest for plenty of taxa. Choose based on dataset size and desired rigor.

Thumb Rule — Chain Length: BEAST2 dating analyses typically need 10⁸–10⁹ generations. That's 10–100× longer than a standard tree search. Budget HPC time accordingly (days to weeks).

Thumb Rule — ucldStdev: This BEAST2 parameter measures how much the clock deviates from strict. Posterior median <0.1 = strict clock is fine. 0.1–0.5 = moderate relaxation. >1.0 = extreme rate variation, relaxed clock essential.

Thumb Rule — Prior Sensitivity: Always run BEAST2 with sampleFromPrior="true" (no data) and compare to the posterior. If they look the same for a parameter, the data is not informing that parameter — your result is driven entirely by the prior.

Module 2: Trees & Networks Module 4: Ancestral States & Selection

Divergence Time Estimation — Module 3

On this page

2.1 — Molecular Clock Theory

Testing for Clock-Likeness

2.2 — Fossil Calibration Strategy

Rules for Good Calibrations

Calibration Resources

Fossil Calibration Database

Paleobiology Database (PBDB)

TimeTree

Geologic Time Scale 2020

The #1 Calibration Mistake

2.3 — Node Dating with BEAST2

Full BEAST2 Workflow

2.4 — Tip Dating & Total Evidence Dating

Fossilized Birth-Death (FBD) Process

Sampled Ancestor Trees

2.5 — Alternative Dating Approaches

MCMCTree (PAML) — Approximate Likelihood

treePL — Penalized Likelihood

2.6 — Rate Variation Across Lineages

AI Prompt Guide

Divergence Time Estimation Pipeline

⚠ Troubleshooting

BEAST2: Posterior and prior ESS are low despite long chains

BEAST2: Dates are unreasonably old or young

MCMCTree: "Hessian not positive definite"

treePL: Cross-validation gives no clear optimal smoothing

Mnemonics & Thumb Rules