Module 7 — PopGen, Phylogeography & Comparative Methods

Demographic Inference · Introgression Detection · Phylogeographic Modeling · Species Delimitation · Advanced Comparative Methods · Paleogenomics

~180 min Advanced Bash / R / Python
Byte

7.1 — Demographic Inference

ToolInputTime RangeStrengths
PSMC1 diploid genome~20 kya – 3 MyaNeeds only 1 individual; deep history
MSMC22–8 haplotypes~1 kya – 3 MyaCross-coalescence → split times
SMC++Hundreds of genomes~100 ya – 100 kyaRecent demography with many samples
Stairway Plot 2SFS from many samples~100 ya – 1 MyaNo phasing needed, SFS-based
fastsimcoal2Joint SFSAnyComplex multi-population models
Bash — PSMC & MSMC2
# PSMC: infer Ne(t) from a single diploid genome
# Step 1: Generate consensus diploid sequence
samtools mpileup -C50 -uf reference.fa sample.bam | \
  bcftools call -c | vcfutils.pl vcf2fq -d 10 -D 100 | \
  gzip > diploid.fq.gz

# Step 2: Create PSMC input
fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa

# Step 3: Run PSMC
psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o output.psmc diploid.psmcfa

# Step 4: Bootstrap (100 replicates for CIs)
splitfa diploid.psmcfa > split.psmcfa
seq 100 | xargs -P 16 -I{} psmc -N25 -t15 -r5 \
  -p "4+25*2+4+6" -b -o round{}.psmc split.psmcfa

# Step 5: Plot
psmc_plot.pl -u 1.4e-08 -g 25 -p output output.psmc
# -u: mutation rate per site per generation
# -g: generation time in years
# Output: Ne(t) curve from ~20 kya to ~3 Mya

# MSMC2: cross-coalescence for population split times
msmc2 --fixedRecombination -t 16 \
  -o pop1_pop2 \
  pop1_hap1.txt pop1_hap2.txt pop2_hap1.txt pop2_hap2.txt
# Output: relative cross-coalescence rate (rCCR)
# rCCR = 1 → single population; rCCR = 0 → fully separated
# The time when rCCR drops from 1 to 0 = population split time
Download:
Bash — fastsimcoal2 Complex Demographics
# fastsimcoal2: fit complex demographic models to joint SFS
# Step 1: Compute joint SFS from VCF
# Use easySFS or ANGSD to generate observed SFS

# Step 2: Define demographic model (.tpl + .est files)
# Example: 2 populations with divergence + migration + bottleneck
# model.tpl:
# //Parameters for the coalescent simulation program:
# 2 samples to simulate :
# //Population effective sizes (number of genes)
# NPOP1
# NPOP2
# //Samples sizes
# 20
# 20
# //Growth rates
# 0
# 0
# //Number of migration matrices
# 2
# //Migration matrix 0
# 0 MIG12
# MIG21 0
# //Migration matrix 1
# 0 0
# 0 0
# //historical events: time, src, sink, migrants, resize, growth, migmat
# 1 historical events
# TDIV 1 0 1 RESIZE 0 1

# Step 3: Run optimization
fsc27 -t model.tpl -e model.est -n 100000 -m -M \
  -L 40 -c 16 -B 16 --multiSFS
# -L 40: 40 optimization cycles
# -M: compute SFS from maximum likelihood parameters

# Step 4: Bootstrap for confidence intervals
# Run 50-100 non-parametric bootstraps from simulated SFS

# Step 5: Model comparison with AIC
# Compare nested models (with/without migration, bottleneck, etc.)
Download:
Thumb Rule — Scaling PSMC/MSMC: All results depend on two numbers you provide: mutation rate (μ) and generation time (g). If either is wrong, all dates shift proportionally. Always report the μ and g used, and show how results change with alternative values.

7.2 — Admixture & Introgression Detection

Bash — D-statistics & Dsuite
# D-statistics (ABBA-BABA) framework
# Tests: is there excess allele sharing between P2 and P3?
# Topology: ((P1,P2),P3),O)
# D = 0: no introgression (ILS only)
# D > 0: P2-P3 introgression (excess ABBA)
# D < 0: P1-P3 introgression (excess BABA)

# Dsuite: efficient D-statistics for all population trios
Dsuite Dtrios -t species_tree.nwk input.vcf.gz populations.txt
# Output: BBAA, ABBA, BABA counts + D-statistic + p-value for all trios

# f-branch statistic: assign introgression to specific branches
Dsuite Fbranch species_tree.nwk Dtrios_output.txt > fbranch.txt
# Visualize with:
python dtools.py fbranch.txt species_tree.nwk

# ADMIXTOOLS2 in R: formal tests
library(admixtools)
f2_blocks <- f2_from_geno("prefix")  # EIGENSTRAT format input

# f3 test: f3(A; B, C) < 0 means A is admixed from B and C
f3_result <- qp3pop(f2_blocks, source1="Pop_B", source2="Pop_C",
                     target="Pop_A")

# f4 test: Patterson's D equivalent
f4_result <- qpDstat(f2_blocks, pop1="P1", pop2="P2",
                      pop3="P3", pop4="O")

# qpAdm: model target as mixture of source populations
qpadm_result <- qpadm(f2_blocks, target="Admixed_Pop",
                        sources=c("Source1","Source2"),
                        outgroups=c("Outgroup1","Outgroup2","Outgroup3"))

# qpGraph: fit an admixture graph
# Requires specifying a topology and testing fit
Download:
Mnemonic

"ABBA = Ancestral-Back-Back-Ancestral"

In the pattern ABBA, P1 has ancestral allele, P2 has derived, P3 has derived, outgroup has ancestral. Excess ABBA over BABA means P2 and P3 share more alleles than expected under ILS alone → introgression between P2 and P3.

7.3 — Phylogeographic Modeling

Workflow — BEAST2 Discrete Phylogeography
# Discrete trait phylogeography in BEAST2:
# 1. BEAUti: load alignment
# 2. Tip Dates: if heterochronous (e.g., viral sequences)
# 3. Partitions tab: add a "discrete trait" partition
#    - Column = geographic location for each sample
# 4. Clock model for trait: strict clock
# 5. Substitution model for trait: symmetric or asymmetric
#    - BSSVS (Bayesian Stochastic Search Variable Selection)
#    - Identifies significant migration routes
# 6. Tree prior: coalescent or birth-death
# 7. MCMC: 100,000,000+ generations

# After running:
# - SpreaD3: visualize geographic spread on a map
#   java -jar SpreaD3.jar
#   Input: MCC tree with location annotations
#   Output: animated KML file for Google Earth
#   or D3.js visualization

# Continuous phylogeography:
# Instead of discrete locations, use latitude/longitude coordinates
# BEAUti: set tip locations as continuous trait (x,y coordinates)
# Diffusion model: random walk, Cauchy, or relaxed random walk
# Visualize with SpreaD3 continuous mode

Structured Coalescent (BASTA / Mascot)

Bash
# Mascot (BEAST2 package): structured coalescent
# More appropriate than discrete trait models when population structure
# is strong and gene flow is ongoing (not just historical events)

# Install: BEAST2 Package Manager → MASCOT
# BEAUti setup:
# - Tree prior: Structured Coalescent (Mascot)
# - Assign each sample to a population/deme
# - Parameters: Ne per deme, migration rates between demes
# - MCMC: very long chains (structured coalescent is slow)

# Advantage over discrete trait: properly models the coalescent
# process within structured populations
# Disadvantage: limited to ~10 demes, computationally expensive

7.4 — Species Delimitation

Bash — BPP & mPTP
# BPP: Bayesian Phylogenetics and Phylogeography
# Joint species delimitation and species tree estimation
# Uses multispecies coalescent — most rigorous method

# bpp.ctl:
  seed = -1
  seqfile = alignment.phy
  Imapfile = imap.txt      # maps individuals → species
  outfile = bpp_output.txt
  speciesdelimitation = 1   # 1 = delimitation mode
  speciestree = 1           # 1 = also estimate species tree
  species&tree = 4  A B C D  # initial species assignment
                    3 3 2 2   # samples per species
                    ((A,B),(C,D));  # guide tree
  usedata = 1
  nloci = 100
  cleandata = 0
  thetaprior = gamma(2, 100)
  tauprior = gamma(2, 200)
  finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
  burnin = 50000
  sampfreq = 5
  nsample = 100000

bpp --cfile bpp.ctl

# mPTP: multi-rate Poisson Tree Processes
# Single-locus, distance-based — fast but less rigorous
mptp --tree_file ml_tree.nwk --output_file delimit_result \
  --ml --multi --minbr_auto
# Output: delimitation of species based on branch length distribution
# Branches within species are shorter (coalescent) than between species

# bGMYC: Bayesian GMYC (R package)
# library(bGMYC)
# result <- bgmyc.multiphylo(posterior_trees, mcmc=50000)
# Similar to mPTP but uses Bayesian framework on a tree posterior
Download:
Integrative Species Delimitation
  • BPP = gold standard for multilocus data (most rigorous, uses MSC)
  • mPTP/GMYC = quick single-locus methods (good for barcoding data)
  • DELINEATE = newer method for species delimitation with missing data
  • Always combine molecular results with morphological, ecological, and geographic evidence (integrative taxonomy)
  • Beware over-splitting: all methods tend to over-split when population structure is strong within species

8.1 — Phylogenetic Generalized Linear Mixed Models

R — MCMCglmm
library(MCMCglmm)
library(ape)

tree <- read.tree("dated_tree.nwk")
# Create inverse phylogenetic covariance matrix
inv.phylo <- inverseA(tree, nodes="ALL", scale=TRUE)

data <- read.csv("species_data.csv")
# Columns: animal (= species name matching tree), trait1, trait2, binary_trait

# Gaussian response with phylogenetic random effect
prior1 <- list(R=list(V=1, nu=0.002),
               G=list(G1=list(V=1, nu=0.002)))
model1 <- MCMCglmm(trait1 ~ trait2,
                     random = ~animal,
                     family = "gaussian",
                     ginverse = list(animal=inv.phylo$Ainv),
                     data = data,
                     prior = prior1,
                     nitt = 1300000, burnin = 300000, thin = 1000)
summary(model1)

# Binary response (e.g., presence/absence)
prior2 <- list(R=list(V=1, fix=1),
               G=list(G1=list(V=1, nu=0.002)))
model2 <- MCMCglmm(binary_trait ~ trait1,
                     random = ~animal,
                     family = "categorical",
                     ginverse = list(animal=inv.phylo$Ainv),
                     data = data,
                     prior = prior2,
                     nitt = 2600000, burnin = 600000, thin = 2000)

# Phylogenetic signal (heritability):
# h2 = variance_phylo / (variance_phylo + variance_residual)
h2 <- model1$VCV[,"animal"] /
      (model1$VCV[,"animal"] + model1$VCV[,"units"])
posterior.mode(h2)
HPDinterval(h2)
Download:

Phylogenetic Path Analysis

R — phylopath
# phylopath: test causal hypotheses among correlated traits
library(phylopath)

tree <- read.tree("dated_tree.nwk")
data <- read.csv("species_traits.csv", row.names=1)

# Define competing causal models as DAGs
models <- define_model_set(
  # Model 1: body_size → brain_size → longevity
  m1 = c(brain_size ~ body_size, longevity ~ brain_size),
  # Model 2: body_size → both brain_size and longevity independently
  m2 = c(brain_size ~ body_size, longevity ~ body_size),
  # Model 3: body_size → brain_size → longevity + direct effect
  m3 = c(brain_size ~ body_size, longevity ~ brain_size + body_size)
)

# Fit all models with phylogenetic correction
result <- phylo_path(models, data=data, tree=tree, model="lambda")

# Compare with CICc (C-statistic information criterion)
s <- summary(result)
print(s)  # ranked by CICc — lowest is best

# Average across best models
avg_model <- average(result, cut_off=2)  # average models within delta CICc < 2
plot(avg_model)

8.4 — Co-phylogenetics & Host-Parasite Coevolution

R
# PACo: Procrustean Approach to Cophylogeny
library(paco)

host_tree <- read.tree("host_tree.nwk")
parasite_tree <- read.tree("parasite_tree.nwk")
# Association matrix: which parasite on which host
assoc <- read.csv("associations.csv")  # host, parasite columns

D <- prepare_paco_data(host_tree, parasite_tree, assoc)
D <- add_pcoords(D)
result <- PACo(D, nperm=10000, seed=42)
print(result)
# Significant result → congruence between host and parasite phylogenies
# Residuals per link: large residual = host switch event

# ParaFit: global test of cospeciation
library(ape)
parafit_result <- parafit(
  host_D=cophenetic(host_tree),
  para_D=cophenetic(parasite_tree),
  HP=assoc_matrix,
  nperm=999
)
# Significant → host-parasite cospeciation signal

# eMPRess: event-based reconciliation (GUI/command line)
# Reconstructs: cospeciation, duplication, host switch, loss
# empress --host host.nwk --parasite parasite.nwk --mapping assoc.txt
Download:

9.1 — Ancient DNA (aDNA) Processing

Bash — aDNA Pipeline
# Ancient DNA has specific challenges:
# 1. Short fragments (30-80 bp typical)
# 2. C→T damage at 5' ends, G→A at 3' ends (deamination)
# 3. Low endogenous content (1-10% in most samples)
# 4. Contamination from modern DNA

# Step 1: Adapter trimming (short overlapping fragments!)
AdapterRemoval --file1 R1.fq.gz --file2 R2.fq.gz \
  --collapse --trimns --trimqualities --minlength 30 \
  --output1 trimmed_R1.fq.gz --output2 trimmed_R2.fq.gz \
  --outputcollapsed collapsed.fq.gz

# Step 2: Align with BWA-ALN (NOT BWA-MEM for short aDNA reads)
bwa aln -l 1024 -n 0.01 -o 2 reference.fa collapsed.fq.gz > aln.sai
bwa samse reference.fa aln.sai collapsed.fq.gz | \
  samtools sort -o aligned.bam
samtools index aligned.bam

# Step 3: Remove duplicates
picard MarkDuplicates I=aligned.bam O=dedup.bam M=metrics.txt \
  REMOVE_DUPLICATES=true

# Step 4: Authenticate — check damage patterns
mapDamage2 -i dedup.bam -r reference.fa -d mapDamage_results/
# Expect: elevated C→T at position 1 of reads (5' end)
#         elevated G→A at last position (3' end)
# If no damage pattern → likely modern contamination

# Step 5: Rescale base qualities (account for damage)
mapDamage2 -i dedup.bam -r reference.fa --rescale \
  -d mapDamage_rescaled/

# Step 6: PMDtools — filter for authentically ancient reads
samtools view -h dedup.bam | \
  pmdtools --threshold 3 --header | \
  samtools view -Sb - > pmd_filtered.bam

# Step 7: Contamination estimation
# For mitochondrial data:
schmutzi --notab ref_mt.fa dedup.bam
# For nuclear: ANGSD contamination estimate (for males, X chromosome)
angsd -i dedup.bam -r X: -doCounts 1 -iCounts 1 \
  -minMapQ 30 -minQ 20

# Step 8: Genotype likelihoods (for low-coverage aDNA)
# Use ANGSD — never hard-call genotypes from low-coverage aDNA
angsd -i pmd_filtered.bam -GL 2 -doGlf 2 \
  -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 \
  -minMapQ 30 -minQ 20 -out ancient_GL
Download:

Phylogenetic Placement of Ancient Samples

Bash
# EPA-ng: place short fragments into a reference tree
# Build reference tree and alignment first (modern samples only)
epa-ng --tree reference_tree.nwk --ref-msa reference_alignment.fasta \
  --query ancient_sequences.fasta --redo

# pplacer: alternative placement tool
pplacer -t reference_tree.nwk -r reference_alignment.fasta \
  -s reference.stats ancient_sequences.fasta

# Output: .jplace file with placement probabilities
# Visualize with gappa or iTOL
Mnemonic

"C to T at the Tip = anCienT"

The hallmark of authentic ancient DNA is C→T substitutions at the 5' end of reads (cytosine deamination). If mapDamage doesn't show this pattern, your DNA is likely modern contamination. No damage = no trust.

AI Prompt Guide

Population Genomics & Phylogeography
I have [WGS / RADseq / targeted capture] data from [NUMBER] individuals across [NUMBER] populations of [SPECIES]. Reference genome: [available / de novo assembly needed]. Coverage: [HIGH >20x / LOW 1-5x / ANCIENT <1x]. I want to: 1. Infer demographic history (Ne over time) with PSMC/MSMC2 2. Test for introgression between [POP_A] and [POP_B] using D-statistics 3. Fit a demographic model with fastsimcoal2 (divergence + migration + bottleneck) 4. Perform species delimitation with BPP 5. Run phylogeographic analysis in BEAST2 (discrete trait) 6. [If aDNA]: process ancient samples with mapDamage authentication Please provide complete Bash/R scripts for each analysis including: - Proper input file preparation from VCF - SLURM scripts for HPC - Interpretation guidelines for key statistics - Visualization code for each result

Troubleshooting

PSMC: Flat or unrealistic Ne curve

Causes: (1) Too few heterozygous sites (low diversity species or too stringent filtering), (2) Excessive sequencing error masking true heterozygosity, (3) Wrong mutation rate or generation time. Check: minimum 20x coverage for PSMC, proper quality filtering, and try different -p (time interval) patterns.

D-statistics: Significant D but unclear biological interpretation

D-statistics detect excess allele sharing but can't distinguish introgression direction or timing. Use: (1) f-branch to localize introgression on the phylogeny, (2) Local ancestry inference (RFMix) to identify introgression tracts, (3) Multiple outgroups to verify outgroup is not the source of signal.

BPP: Over-splits populations into too many "species"

BPP tends to delimit population structure as species boundaries when gene flow is low. This is a known limitation. Solutions: (1) Use BPP with migration (A11 analysis), (2) Cross-validate with morphological and ecological data, (3) Increase prior on theta (allows more within-species variation), (4) Consider that "species" in BPP means "independently evolving lineage", not necessarily reproductively isolated species.

mapDamage: No damage pattern detected in ancient sample

Possible causes: (1) Sample is actually modern contamination — discard, (2) Library was UDG-treated (removes damaged bases — this is intentional for SNP calling), (3) Damage was already trimmed during adapter removal. Check library preparation protocol before concluding contamination.

Mnemonics & Thumb Rules

Mnemonic

"PSMC for Past, SMC++ for preSent"

PSMC resolves deep history (20 kya – 3 Mya) from a single genome. SMC++ resolves recent history (100 – 100,000 ya) from many genomes. Use both together for a complete demographic picture.

Mnemonic

"D > 0 = Derived sharing between P2 & P3"

Positive D-statistic means P2 and P3 share more derived alleles than expected. This is the ABBA-BABA test. Always remember: D detects the signal but doesn't tell you the direction of gene flow — P3 could have introgressed into P2 OR vice versa.

Thumb Rule — aDNA Coverage: For population genetics: ≥1x for genotype likelihoods (ANGSD). For phylogenetic placement: 0.1x can work with EPA-ng. For PSMC: need ≥20x (almost never achievable for ancient samples). For variant calling: never hard-call genotypes below 10x.
Thumb Rule — Species Delimitation: Use at least 3 independent lines of evidence: (1) Molecular (BPP/mPTP), (2) Morphological, (3) Ecological or geographic. Molecular methods alone tend to over-split, especially in structured populations with limited gene flow.