Molecular Clocks · Fossil Calibration Strategy · Node Dating with BEAST2 · Tip Dating & Total Evidence · MCMCTree & treePL · Rate Variation
The molecular clock hypothesis assumes that DNA/protein sequences accumulate substitutions at a roughly constant rate over time, allowing divergence dates to be estimated from sequence differences.
| Clock Model | Assumption | When Appropriate | Tool |
|---|---|---|---|
| Strict clock | All lineages evolve at the same rate | Very closely related taxa (within-species, viruses) | BEAST2, MrBayes |
| Uncorrelated lognormal (UCLN) | Rates vary independently across branches from a lognormal distribution | Most common choice for divergence dating | BEAST2 (default) |
| Uncorrelated exponential | Rates drawn from exponential distribution | When rate variation is extreme | BEAST2 |
| Random local clock | Rate shifts at specific branches | When few lineages have dramatically different rates | BEAST2 |
| Autocorrelated (CIR, UGAM) | Parent and child branch rates are correlated | Mammals and other groups with generation-time effects | MCMCTree, PhyloBayes |
library(ape)
library(phangorn)
# Load tree and alignment
tree <- read.tree("ml_tree.nwk")
alignment <- read.phyDat("alignment.phy", format="phylip")
# Likelihood ratio test: strict clock vs no clock
fit_noclock <- pml(tree, alignment, model="GTR+G(4)")
fit_noclock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
rearrangement="none")
fit_clock <- optim.pml(fit_noclock, model="GTR", optGamma=TRUE,
rearrangement="none", control=pml.control(
epsilon=1e-8, maxit=50, trace=0))
# Root-to-tip regression (for heterochronous data)
library(ape)
tree_rooted <- root(tree, outgroup="outgroup_taxon")
tip_dates <- c(TaxonA=2020, TaxonB=2019, TaxonC=2018) # sampling dates
distances <- cophenetic.phylo(tree_rooted)
# Plot root-to-tip distance vs sampling date
# Linear relationship = clock-like behavior
# R² > 0.5 = reasonable clock signal
# TempEst (GUI tool) is the standard for root-to-tip regression
# for virus/pathogen phylogenetics
"UCLN is the Universal Choice"
When in doubt, use the Uncorrelated Lognormal relaxed clock (UCLN). It's the most widely used and tested. Strict clock is only for viruses or very recent divergences. Autocorrelated clocks are better for mammals (generation time effect) but require careful testing.
Fossil calibrations are the most critical and error-prone step in divergence dating. The accuracy of your dates depends almost entirely on how well you choose and justify your calibration points.
| Prior Type | Shape | When to Use | BEAST2 Syntax |
|---|---|---|---|
| Lognormal | Skewed right from minimum | Most calibrations — allows the true divergence to be somewhat older than the fossil | LogNormal(mean, sigma, offset) |
| Exponential | Sharp decay from minimum | When the fossil record is dense and the oldest fossil is close to the true age | Exponential(mean, offset) |
| Uniform | Flat between min–max | When both a minimum AND a reliable maximum age exist (rare) | Uniform(lower, upper) |
| Normal | Symmetric around mean | Secondary calibrations from previous studies (not fossils) | Normal(mean, sigma) |
<!-- Example calibration priors in BEAST2 XML -->
<!-- Calibration 1: Crown angiosperms, oldest fossil ~130 Ma -->
<distribution id="cal_angiosperms" spec="beast.math.distributions.MRCAPrior"
tree="@Tree.t:alignment" monophyletic="true">
<taxonset id="angiosperms" spec="TaxonSet">
<taxon id="Arabidopsis" spec="Taxon"/>
<taxon id="Oryza" spec="Taxon"/>
<taxon id="Amborella" spec="Taxon"/>
</taxonset>
<!-- Lognormal: offset=130, mean=1.5, sigma=0.5 -->
<!-- This places 95% CI at 130-175 Ma -->
<LogNormal name="distr" offset="130.0" meanInRealSpace="true">
<parameter name="M" value="1.5"/>
<parameter name="S" value="0.5"/>
</LogNormal>
</distribution>
<!-- Calibration 2: Crown mammals, oldest fossil ~66 Ma -->
<distribution id="cal_mammals" spec="beast.math.distributions.MRCAPrior"
tree="@Tree.t:alignment" monophyletic="true">
<taxonset id="mammals" spec="TaxonSet">
<taxon id="Homo" spec="Taxon"/>
<taxon id="Mus" spec="Taxon"/>
</taxonset>
<Exponential name="distr" offset="66.0">
<parameter name="mean" value="10.0"/>
</Exponential>
</distribution>
Curated, peer-reviewed fossil calibrations (fossilcalibrations.org)
DatabaseFossil occurrence data with geological ages (paleobiodb.org)
DatabasePublished divergence times for many clades (timetree.org)
SecondaryOfficial epoch/period boundaries for geological context
ReferenceUsing the oldest known fossil as the exact age of the node (hard minimum = hard maximum). The fossil record is incomplete — the true origin is almost always older. Use lognormal or exponential priors with the fossil age as the offset (minimum), not the mean.
# Step 1: Prepare XML in BEAUti # - Import alignment(s) # - Partitions tab: set substitution models per partition # - Clock tab: select Relaxed Clock Log Normal # - Priors tab: Tree prior → Birth Death / Yule # - Priors tab: Add MRCA priors (fossil calibrations) # - MCMC tab: chain length 100,000,000, log every 10,000 # Step 2: Run BEAST2 on HPC #!/bin/bash #SBATCH --job-name=beast2 #SBATCH --cpus-per-task=16 #SBATCH --mem=64G #SBATCH --time=7-00:00:00 #SBATCH --output=beast2_%j.log module load beast2/2.7.6 module load beagle # GPU acceleration if available # With BEAGLE (GPU or SSE acceleration): beast -threads 16 -beagle_SSE -seed 12345 analysis.xml # Resume from checkpoint (if wall time exceeded): beast -resume -threads 16 analysis.xml # Step 3: Assess convergence # Open .log file in Tracer # CHECK: ESS > 200 for ALL parameters # CHECK: posterior, likelihood, tree height, clock rate # Step 4: Combine runs (if running multiple independent chains) logcombiner -log run1.log -log run2.log -o combined.log -burnin 10 # Step 5: Summarize tree distribution treeannotator -heights median -burnin 10 \ combined.trees dated_tree.nex # Step 6: Visualize in FigTree # Open dated_tree.nex → Display: Node bars (95% HPD) # Export as SVG/PDF for publication
| Tree Prior | Assumption | When to Use |
|---|---|---|
| Yule | Pure birth (no extinction) | Species-level phylogenies, rapid diversification |
| Birth-Death | Birth + death (turnover) | Most divergence dating studies (recommended default) |
| Coalescent Constant | Constant effective population size | Population-level data (within species) |
| Coalescent Bayesian Skyline | Ne changes over time | Virus/pathogen phylodynamics |
| Fossilized Birth-Death | Birth + death + fossil sampling | Tip dating with fossil taxa (Section 2.4) |
ucldStdev parameter measures rate variation: if posterior median ≈ 0, a strict clock is sufficientInstead of calibrating internal nodes, tip dating assigns ages directly to fossil taxa included in the analysis. Combined with molecular data from extant taxa and morphological data from fossils, this is total-evidence dating.
# Total-evidence dating in BEAST2: # 1. Install packages: SA (Sampled Ancestors), morph-models # 2. Data: molecular alignment (extant) + morphological matrix (fossil+extant) # 3. BEAUti setup: # - Import molecular partition (e.g., 5 genes) # - Import morphological partition (discrete characters) # - Tip dates: assign fossil ages (in Ma) # Tip Dates tab → check "Use tip dates" # Set each fossil taxon's age (e.g., 66.0 Ma) # - Model for morphology: Mk model (Lewis 2001) # Lewis Mk with gamma rate variation # Set "ascertainmentBias" = conditioning on variable characters # - Clock: relaxed log-normal for molecular AND morphological # - Tree prior: Fossilized Birth-Death # Parameters: diversification rate, turnover, sampling proportion # - MCMC: 200,000,000 generations (tip dating needs longer chains) # Key FBD parameters: # diversificationRate = speciation − extinction # turnover = extinction / speciation # samplingProportion = fossil sampling rate # rho = extant sampling fraction (e.g., 0.1 if 10% of extant species sampled)
In the FBD model, fossils can be sampled ancestors (direct ancestors of living taxa, sitting on branches) rather than tips. The SA package in BEAST2 enables this. If a fossil has zero-length branch to its descendant, it's a sampled ancestor. This is biologically realistic — some fossils really are ancestral to living species.
+G) to relax this. Always condition on variable characters (ascertainmentBias) because invariant morphological characters are never coded.
# MCMCTree: fast Bayesian dating using approximate likelihood # Much faster than BEAST2 for large phylogenomic datasets # Step 1: Estimate branch lengths with BASEML (fixed topology) # mcmctree.ctl: seqfile = alignment.phy treefile = tree_with_calibrations.nwk outfile = out ndata = 5 * number of partitions usedata = 3 * 3 = calculate gradient/Hessian for approx likelihood clock = 2 * 2 = independent rates (UCLN-like), 3 = autocorrelated model = 4 * 4 = HKY85 BDparams = 1 1 0.1 * birth, death, sampling rgene_gamma = 2 20 * gamma prior on rates sigma2_gamma = 1 10 * gamma prior on rate variance mcmctree mcmctree.ctl # Step 2: Run MCMC with approximate likelihood # Change usedata = 2 (use the Hessian from step 1) # Re-run: mcmctree mcmctree.ctl # Calibration format in tree file: # (((A, B) '>0.66<1.30', C), D); # '>0.66' = minimum 66 Ma, '<1.30' = maximum 130 Ma (in 100 Ma units) # 'B(0.66, 1.30, 0.025, 0.025)' = soft bounds (recommended)
# treePL: fast penalized likelihood dating # Good for large trees (1000+ taxa) # Config file: treepl.cfg treefile = ml_tree.nwk smooth = 100 # smoothing parameter (cross-validate!) numsites = 50000 # total alignment length mrca = cal1 TaxonA TaxonB min = cal1 66 max = cal1 130 mrca = cal2 TaxonC TaxonD min = cal2 45 max = cal2 90 # Cross-validate smoothing parameter: thorough cvstart = 0.001 cvstop = 10000 cvmultstep = 10 # Run treePL treepl.cfg > treepl_dated.nwk # Alternative: RelTime in MEGA (very fast, approximate) # GUI: load alignment + tree, set calibrations # Good for exploratory analysis, not for final publication dates
| Tool | Method | Speed | Accuracy | Best For |
|---|---|---|---|---|
| BEAST2 | Full Bayesian MCMC | Very slow | Best (with good calibrations) | Final publication dates, <200 taxa |
| MCMCTree | Approx. likelihood Bayesian | Fast | Good | Phylogenomic datasets, 50–500 taxa |
| treePL | Penalized likelihood | Very fast | Moderate | Large trees (1000+ taxa) |
| RelTime | Relative rate framework | Instant | Approximate | Exploratory, quick estimates |
| r8s | PL / Langley-Fitch | Fast | Moderate | Legacy, small datasets |
Evolutionary rates vary dramatically across lineages due to differences in generation time, metabolic rate, effective population size, DNA repair efficiency, and life history traits.
library(ape)
library(ggtree)
library(treeio)
# Load BEAST2 MCC tree (has rate annotations)
beast_tree <- read.beast("dated_tree.nex")
# Plot with branch rates as color
ggtree(beast_tree, aes(color=rate)) +
scale_color_viridis_c(name="Substitution rate\n(subs/site/Myr)") +
geom_tiplab(size=2) +
theme_tree2() +
ggtitle("Rate variation across lineages")
# Plot with node age bars (95% HPD)
ggtree(beast_tree, mrsd="2024-01-01") +
geom_range("height_0.95_HPD", color="steelblue", alpha=0.3, size=2) +
geom_tiplab(size=2) +
theme_tree2() +
scale_x_continuous(labels=abs) +
xlab("Time (Ma)")
# Test for rate variation: coefficient of variation from BEAST2
# ucldStdev parameter in .log file
# Posterior median ucldStdev < 0.1 → strict clock sufficient
# ucldStdev > 0.5 → substantial rate variation
# ucldStdev > 1.0 → extreme rate variation
| Factor | Effect on Rate | Example |
|---|---|---|
| Generation time | Shorter generations → faster evolution (per year) | Mice evolve faster than elephants |
| Metabolic rate | Higher metabolism → more mutagenic byproducts | Birds vs. reptiles |
| Population size (Ne) | Small Ne → more drift, slightly different rate dynamics | Island vs. mainland populations |
| DNA repair | Better repair → slower rate | UV-exposed organisms may evolve faster |
| Life history | Annual vs. perennial plants differ systematically | Herbaceous plants faster than trees |
"Generation time Governs rate" — GG
The generation time hypothesis explains most rate variation in animals and plants. Short-lived organisms copy their DNA more often per unit time = more mutations per year. This is why mice have ~2× the molecular rate of humans.
Common issue with dating analyses. Solutions: (1) Increase chain length to 108 or 109, (2) Use BEAGLE for GPU acceleration, (3) Simplify: use fewer partitions or link clock models, (4) Try MCMCTree instead — its approximate likelihood converges faster.
Almost always a calibration problem. Check: (1) Are calibration offsets in the right units (Ma, not years)? (2) Is the fossil placed at the correct node? (3) Are prior distributions too wide? Run without data (sampleFromPrior="true") and check if priors alone produce reasonable ages.
The approximate likelihood failed. Causes: (1) Extremely short branch lengths (collapse near-zero branches), (2) Over-parameterized partition model. Try: reduce number of partitions, use simpler substitution model, or increase finetune values.
The data may lack clear clock-like signal. Try: (1) Use a wider range of smoothing values, (2) Check if the tree has very uneven branch lengths (rate variation too extreme for PL), (3) Switch to BEAST2/MCMCTree which handle rate variation better.
"Fossils give Floors, not Ceilings" — FFC
Fossils provide minimum ages (floors). The true divergence is always older (possibly much older) because the fossil record is incomplete. Use lognormal priors with the fossil age as the offset (floor), not the mean.
"BEAST for Best, MCMCTree for Many, treePL for Plenty"
BEAST2 = most accurate but slowest. MCMCTree = fast Bayesian for many loci. treePL = fastest for plenty of taxa. Choose based on dataset size and desired rigor.
sampleFromPrior="true" (no data) and compare to the posterior. If they look the same for a parameter, the data is not informing that parameter — your result is driven entirely by the prior.