Demographic Inference · Introgression Detection · Phylogeographic Modeling · Species Delimitation · Advanced Comparative Methods · Paleogenomics
| Tool | Input | Time Range | Strengths |
|---|---|---|---|
| PSMC | 1 diploid genome | ~20 kya – 3 Mya | Needs only 1 individual; deep history |
| MSMC2 | 2–8 haplotypes | ~1 kya – 3 Mya | Cross-coalescence → split times |
| SMC++ | Hundreds of genomes | ~100 ya – 100 kya | Recent demography with many samples |
| Stairway Plot 2 | SFS from many samples | ~100 ya – 1 Mya | No phasing needed, SFS-based |
| fastsimcoal2 | Joint SFS | Any | Complex multi-population models |
# PSMC: infer Ne(t) from a single diploid genome
# Step 1: Generate consensus diploid sequence
samtools mpileup -C50 -uf reference.fa sample.bam | \
bcftools call -c | vcfutils.pl vcf2fq -d 10 -D 100 | \
gzip > diploid.fq.gz
# Step 2: Create PSMC input
fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa
# Step 3: Run PSMC
psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o output.psmc diploid.psmcfa
# Step 4: Bootstrap (100 replicates for CIs)
splitfa diploid.psmcfa > split.psmcfa
seq 100 | xargs -P 16 -I{} psmc -N25 -t15 -r5 \
-p "4+25*2+4+6" -b -o round{}.psmc split.psmcfa
# Step 5: Plot
psmc_plot.pl -u 1.4e-08 -g 25 -p output output.psmc
# -u: mutation rate per site per generation
# -g: generation time in years
# Output: Ne(t) curve from ~20 kya to ~3 Mya
# MSMC2: cross-coalescence for population split times
msmc2 --fixedRecombination -t 16 \
-o pop1_pop2 \
pop1_hap1.txt pop1_hap2.txt pop2_hap1.txt pop2_hap2.txt
# Output: relative cross-coalescence rate (rCCR)
# rCCR = 1 → single population; rCCR = 0 → fully separated
# The time when rCCR drops from 1 to 0 = population split time
# fastsimcoal2: fit complex demographic models to joint SFS # Step 1: Compute joint SFS from VCF # Use easySFS or ANGSD to generate observed SFS # Step 2: Define demographic model (.tpl + .est files) # Example: 2 populations with divergence + migration + bottleneck # model.tpl: # //Parameters for the coalescent simulation program: # 2 samples to simulate : # //Population effective sizes (number of genes) # NPOP1 # NPOP2 # //Samples sizes # 20 # 20 # //Growth rates # 0 # 0 # //Number of migration matrices # 2 # //Migration matrix 0 # 0 MIG12 # MIG21 0 # //Migration matrix 1 # 0 0 # 0 0 # //historical events: time, src, sink, migrants, resize, growth, migmat # 1 historical events # TDIV 1 0 1 RESIZE 0 1 # Step 3: Run optimization fsc27 -t model.tpl -e model.est -n 100000 -m -M \ -L 40 -c 16 -B 16 --multiSFS # -L 40: 40 optimization cycles # -M: compute SFS from maximum likelihood parameters # Step 4: Bootstrap for confidence intervals # Run 50-100 non-parametric bootstraps from simulated SFS # Step 5: Model comparison with AIC # Compare nested models (with/without migration, bottleneck, etc.)
# D-statistics (ABBA-BABA) framework
# Tests: is there excess allele sharing between P2 and P3?
# Topology: ((P1,P2),P3),O)
# D = 0: no introgression (ILS only)
# D > 0: P2-P3 introgression (excess ABBA)
# D < 0: P1-P3 introgression (excess BABA)
# Dsuite: efficient D-statistics for all population trios
Dsuite Dtrios -t species_tree.nwk input.vcf.gz populations.txt
# Output: BBAA, ABBA, BABA counts + D-statistic + p-value for all trios
# f-branch statistic: assign introgression to specific branches
Dsuite Fbranch species_tree.nwk Dtrios_output.txt > fbranch.txt
# Visualize with:
python dtools.py fbranch.txt species_tree.nwk
# ADMIXTOOLS2 in R: formal tests
library(admixtools)
f2_blocks <- f2_from_geno("prefix") # EIGENSTRAT format input
# f3 test: f3(A; B, C) < 0 means A is admixed from B and C
f3_result <- qp3pop(f2_blocks, source1="Pop_B", source2="Pop_C",
target="Pop_A")
# f4 test: Patterson's D equivalent
f4_result <- qpDstat(f2_blocks, pop1="P1", pop2="P2",
pop3="P3", pop4="O")
# qpAdm: model target as mixture of source populations
qpadm_result <- qpadm(f2_blocks, target="Admixed_Pop",
sources=c("Source1","Source2"),
outgroups=c("Outgroup1","Outgroup2","Outgroup3"))
# qpGraph: fit an admixture graph
# Requires specifying a topology and testing fit
"ABBA = Ancestral-Back-Back-Ancestral"
In the pattern ABBA, P1 has ancestral allele, P2 has derived, P3 has derived, outgroup has ancestral. Excess ABBA over BABA means P2 and P3 share more alleles than expected under ILS alone → introgression between P2 and P3.
# Discrete trait phylogeography in BEAST2: # 1. BEAUti: load alignment # 2. Tip Dates: if heterochronous (e.g., viral sequences) # 3. Partitions tab: add a "discrete trait" partition # - Column = geographic location for each sample # 4. Clock model for trait: strict clock # 5. Substitution model for trait: symmetric or asymmetric # - BSSVS (Bayesian Stochastic Search Variable Selection) # - Identifies significant migration routes # 6. Tree prior: coalescent or birth-death # 7. MCMC: 100,000,000+ generations # After running: # - SpreaD3: visualize geographic spread on a map # java -jar SpreaD3.jar # Input: MCC tree with location annotations # Output: animated KML file for Google Earth # or D3.js visualization # Continuous phylogeography: # Instead of discrete locations, use latitude/longitude coordinates # BEAUti: set tip locations as continuous trait (x,y coordinates) # Diffusion model: random walk, Cauchy, or relaxed random walk # Visualize with SpreaD3 continuous mode
# Mascot (BEAST2 package): structured coalescent # More appropriate than discrete trait models when population structure # is strong and gene flow is ongoing (not just historical events) # Install: BEAST2 Package Manager → MASCOT # BEAUti setup: # - Tree prior: Structured Coalescent (Mascot) # - Assign each sample to a population/deme # - Parameters: Ne per deme, migration rates between demes # - MCMC: very long chains (structured coalescent is slow) # Advantage over discrete trait: properly models the coalescent # process within structured populations # Disadvantage: limited to ~10 demes, computationally expensive
# BPP: Bayesian Phylogenetics and Phylogeography
# Joint species delimitation and species tree estimation
# Uses multispecies coalescent — most rigorous method
# bpp.ctl:
seed = -1
seqfile = alignment.phy
Imapfile = imap.txt # maps individuals → species
outfile = bpp_output.txt
speciesdelimitation = 1 # 1 = delimitation mode
speciestree = 1 # 1 = also estimate species tree
species&tree = 4 A B C D # initial species assignment
3 3 2 2 # samples per species
((A,B),(C,D)); # guide tree
usedata = 1
nloci = 100
cleandata = 0
thetaprior = gamma(2, 100)
tauprior = gamma(2, 200)
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
burnin = 50000
sampfreq = 5
nsample = 100000
bpp --cfile bpp.ctl
# mPTP: multi-rate Poisson Tree Processes
# Single-locus, distance-based — fast but less rigorous
mptp --tree_file ml_tree.nwk --output_file delimit_result \
--ml --multi --minbr_auto
# Output: delimitation of species based on branch length distribution
# Branches within species are shorter (coalescent) than between species
# bGMYC: Bayesian GMYC (R package)
# library(bGMYC)
# result <- bgmyc.multiphylo(posterior_trees, mcmc=50000)
# Similar to mPTP but uses Bayesian framework on a tree posterior
library(MCMCglmm)
library(ape)
tree <- read.tree("dated_tree.nwk")
# Create inverse phylogenetic covariance matrix
inv.phylo <- inverseA(tree, nodes="ALL", scale=TRUE)
data <- read.csv("species_data.csv")
# Columns: animal (= species name matching tree), trait1, trait2, binary_trait
# Gaussian response with phylogenetic random effect
prior1 <- list(R=list(V=1, nu=0.002),
G=list(G1=list(V=1, nu=0.002)))
model1 <- MCMCglmm(trait1 ~ trait2,
random = ~animal,
family = "gaussian",
ginverse = list(animal=inv.phylo$Ainv),
data = data,
prior = prior1,
nitt = 1300000, burnin = 300000, thin = 1000)
summary(model1)
# Binary response (e.g., presence/absence)
prior2 <- list(R=list(V=1, fix=1),
G=list(G1=list(V=1, nu=0.002)))
model2 <- MCMCglmm(binary_trait ~ trait1,
random = ~animal,
family = "categorical",
ginverse = list(animal=inv.phylo$Ainv),
data = data,
prior = prior2,
nitt = 2600000, burnin = 600000, thin = 2000)
# Phylogenetic signal (heritability):
# h2 = variance_phylo / (variance_phylo + variance_residual)
h2 <- model1$VCV[,"animal"] /
(model1$VCV[,"animal"] + model1$VCV[,"units"])
posterior.mode(h2)
HPDinterval(h2)
# phylopath: test causal hypotheses among correlated traits
library(phylopath)
tree <- read.tree("dated_tree.nwk")
data <- read.csv("species_traits.csv", row.names=1)
# Define competing causal models as DAGs
models <- define_model_set(
# Model 1: body_size → brain_size → longevity
m1 = c(brain_size ~ body_size, longevity ~ brain_size),
# Model 2: body_size → both brain_size and longevity independently
m2 = c(brain_size ~ body_size, longevity ~ body_size),
# Model 3: body_size → brain_size → longevity + direct effect
m3 = c(brain_size ~ body_size, longevity ~ brain_size + body_size)
)
# Fit all models with phylogenetic correction
result <- phylo_path(models, data=data, tree=tree, model="lambda")
# Compare with CICc (C-statistic information criterion)
s <- summary(result)
print(s) # ranked by CICc — lowest is best
# Average across best models
avg_model <- average(result, cut_off=2) # average models within delta CICc < 2
plot(avg_model)
# PACo: Procrustean Approach to Cophylogeny
library(paco)
host_tree <- read.tree("host_tree.nwk")
parasite_tree <- read.tree("parasite_tree.nwk")
# Association matrix: which parasite on which host
assoc <- read.csv("associations.csv") # host, parasite columns
D <- prepare_paco_data(host_tree, parasite_tree, assoc)
D <- add_pcoords(D)
result <- PACo(D, nperm=10000, seed=42)
print(result)
# Significant result → congruence between host and parasite phylogenies
# Residuals per link: large residual = host switch event
# ParaFit: global test of cospeciation
library(ape)
parafit_result <- parafit(
host_D=cophenetic(host_tree),
para_D=cophenetic(parasite_tree),
HP=assoc_matrix,
nperm=999
)
# Significant → host-parasite cospeciation signal
# eMPRess: event-based reconciliation (GUI/command line)
# Reconstructs: cospeciation, duplication, host switch, loss
# empress --host host.nwk --parasite parasite.nwk --mapping assoc.txt
# Ancient DNA has specific challenges: # 1. Short fragments (30-80 bp typical) # 2. C→T damage at 5' ends, G→A at 3' ends (deamination) # 3. Low endogenous content (1-10% in most samples) # 4. Contamination from modern DNA # Step 1: Adapter trimming (short overlapping fragments!) AdapterRemoval --file1 R1.fq.gz --file2 R2.fq.gz \ --collapse --trimns --trimqualities --minlength 30 \ --output1 trimmed_R1.fq.gz --output2 trimmed_R2.fq.gz \ --outputcollapsed collapsed.fq.gz # Step 2: Align with BWA-ALN (NOT BWA-MEM for short aDNA reads) bwa aln -l 1024 -n 0.01 -o 2 reference.fa collapsed.fq.gz > aln.sai bwa samse reference.fa aln.sai collapsed.fq.gz | \ samtools sort -o aligned.bam samtools index aligned.bam # Step 3: Remove duplicates picard MarkDuplicates I=aligned.bam O=dedup.bam M=metrics.txt \ REMOVE_DUPLICATES=true # Step 4: Authenticate — check damage patterns mapDamage2 -i dedup.bam -r reference.fa -d mapDamage_results/ # Expect: elevated C→T at position 1 of reads (5' end) # elevated G→A at last position (3' end) # If no damage pattern → likely modern contamination # Step 5: Rescale base qualities (account for damage) mapDamage2 -i dedup.bam -r reference.fa --rescale \ -d mapDamage_rescaled/ # Step 6: PMDtools — filter for authentically ancient reads samtools view -h dedup.bam | \ pmdtools --threshold 3 --header | \ samtools view -Sb - > pmd_filtered.bam # Step 7: Contamination estimation # For mitochondrial data: schmutzi --notab ref_mt.fa dedup.bam # For nuclear: ANGSD contamination estimate (for males, X chromosome) angsd -i dedup.bam -r X: -doCounts 1 -iCounts 1 \ -minMapQ 30 -minQ 20 # Step 8: Genotype likelihoods (for low-coverage aDNA) # Use ANGSD — never hard-call genotypes from low-coverage aDNA angsd -i pmd_filtered.bam -GL 2 -doGlf 2 \ -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 \ -minMapQ 30 -minQ 20 -out ancient_GL
# EPA-ng: place short fragments into a reference tree # Build reference tree and alignment first (modern samples only) epa-ng --tree reference_tree.nwk --ref-msa reference_alignment.fasta \ --query ancient_sequences.fasta --redo # pplacer: alternative placement tool pplacer -t reference_tree.nwk -r reference_alignment.fasta \ -s reference.stats ancient_sequences.fasta # Output: .jplace file with placement probabilities # Visualize with gappa or iTOL
"C to T at the Tip = anCienT"
The hallmark of authentic ancient DNA is C→T substitutions at the 5' end of reads (cytosine deamination). If mapDamage doesn't show this pattern, your DNA is likely modern contamination. No damage = no trust.
Causes: (1) Too few heterozygous sites (low diversity species or too stringent filtering), (2) Excessive sequencing error masking true heterozygosity, (3) Wrong mutation rate or generation time. Check: minimum 20x coverage for PSMC, proper quality filtering, and try different -p (time interval) patterns.
D-statistics detect excess allele sharing but can't distinguish introgression direction or timing. Use: (1) f-branch to localize introgression on the phylogeny, (2) Local ancestry inference (RFMix) to identify introgression tracts, (3) Multiple outgroups to verify outgroup is not the source of signal.
BPP tends to delimit population structure as species boundaries when gene flow is low. This is a known limitation. Solutions: (1) Use BPP with migration (A11 analysis), (2) Cross-validate with morphological and ecological data, (3) Increase prior on theta (allows more within-species variation), (4) Consider that "species" in BPP means "independently evolving lineage", not necessarily reproductively isolated species.
Possible causes: (1) Sample is actually modern contamination — discard, (2) Library was UDG-treated (removes damaged bases — this is intentional for SNP calling), (3) Damage was already trimmed during adapter removal. Check library preparation protocol before concluding contamination.
"PSMC for Past, SMC++ for preSent"
PSMC resolves deep history (20 kya – 3 Mya) from a single genome. SMC++ resolves recent history (100 – 100,000 ya) from many genomes. Use both together for a complete demographic picture.
"D > 0 = Derived sharing between P2 & P3"
Positive D-statistic means P2 and P3 share more derived alleles than expected. This is the ABBA-BABA test. Always remember: D detects the signal but doesn't tell you the direction of gene flow — P3 could have introgressed into P2 OR vice versa.