Demographic History: PSMC & SMC++

Reconstruct Ne trajectories from genomes · PSMC pairwise coalescent · SMC++ multi-sample inference · Confidence intervals via bootstrapping

~120 min Advanced Bash / Python / R
Byte

Overview

The Pairwise Sequentially Markovian Coalescent (PSMC; Li & Durbin 2011) infers the history of effective population size (Ne) from a single diploid genome by analysing the distribution of heterozygous sites along chromosomes. Regions of the genome with many heterozygous sites coalesced recently (large Ne), while long runs of homozygosity coalesced long ago (small Ne). PSMC can reconstruct Ne up to ~5 Mya but struggles with very recent events (<20,000 years). SMC++ (Terhorst et al. 2017) uses multiple individuals and can resolve more recent history.

Byte
Key parameters needed:
  • Mutation rate (μ): per-generation per-base mutation rate for your species. Critical for converting coalescent time to years.
  • Generation time (g): average generation time in years. Also critical — wrong values shift the entire Ne curve in time.
  • Coverage: PSMC requires ≥15× mean depth. Below this, heterozygous sites are missed, leading to artificially low Ne estimates.

Ideal Directory Structure

psmc_snowleopard/
psmc_snowleopard/ ├── 0_reads/ # raw FASTQ downloads ├── 1_alignment/ # BWA-MEM BAM + index │ ├── snowleopard.sorted.bam │ └── snowleopard.sorted.bam.bai ├── 2_consensus/ # consensus diploid FASTA (for PSMC) │ └── snowleopard_consensus.fa.gz ├── 3_psmc/ # PSMC input and output │ ├── snowleopard.psmcfa # encoded input file │ ├── snowleopard.psmc # PSMC output │ └── bootstrap/ # 100 bootstrap runs ├── 4_smcpp/ # SMC++ analysis │ ├── snowleopard.smc.gz │ └── model.final.json └── 5_plots/ ├── psmc_plot.png └── smcpp_plot.png

Dataset

The snow leopard (Panthera uncia) whole-genome dataset represents an endangered large felid with a complex demographic history shaped by Pleistocene glacial cycles in Central Asia. Its manageable genome size (~2.4 Gb) and high-quality reference assembly make it a tractable PSMC tutorial dataset.

PropertyValue
OrganismPanthera uncia (snow leopard)
AccessionPRJNA304161 (SRR2961478 — 30× WGS individual)
Reference genomeGCF_018350335.1 (NCBI RefSeq)
Mutation rate (μ)~2.8 × 10−9 per site per generation (felid rate)
Generation time~4 years
PublicationCho et al. 2013, Genome Biology

Step 1 — Align Reads to Reference

Bash — Download reads and align with BWA-MEM
mkdir -p 0_reads 1_alignment
module load sra-toolkit bwa samtools/1.17

# Download snow leopard WGS reads (~15 GB; use fasterq-dump for speed)
fasterq-dump --split-files --outdir 0_reads SRR2961478
gzip 0_reads/SRR2961478_*.fastq

# Download snow leopard reference genome
mkdir -p ref
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/350/335/GCF_018350335.1_PunCin_v1.0/\
GCF_018350335.1_PunCin_v1.0_genomic.fna.gz -P ref/
gunzip ref/GCF_018350335.1_PunCin_v1.0_genomic.fna.gz
bwa index ref/GCF_018350335.1_PunCin_v1.0_genomic.fna

# Align with BWA-MEM
bwa mem \
    -t 16 \
    -R "@RG\tID:SRR2961478\tSM:snowleopard\tPL:ILLUMINA\tLB:lib1" \
    ref/GCF_018350335.1_PunCin_v1.0_genomic.fna \
    0_reads/SRR2961478_1.fastq.gz \
    0_reads/SRR2961478_2.fastq.gz | \
  samtools sort -@ 8 -o 1_alignment/snowleopard.sorted.bam

samtools index 1_alignment/snowleopard.sorted.bam

# Check mean depth (need >= 15x for PSMC)
samtools depth -a 1_alignment/snowleopard.sorted.bam | \
    awk '{sum+=$3; n++} END {print "Mean depth:", sum/n}'
Download:
Alignment flags decoded
🏷-R "@RG\tID:...\tSM:...\tPL:...\tLB:..."Read group tag. PSMC does not require it, but bcftools consensus (next step) uses the SM (sample name) field to name the output. Always add read groups to BAMs — many downstream tools silently fail without them.
📊samtools depth -a | awkComputes mean coverage across all sites including zeros (-a flag). PSMC requires ≥15× mean depth. At lower coverage, heterozygous sites are missed and Ne is underestimated. The upper depth filter (next step) is set to 2× mean depth to exclude mapping artefacts in repetitive regions.

Step 2 — Call Consensus Diploid FASTA

Bash — bcftools mpileup → consensus FASTA → PSMC input
mkdir -p 2_consensus 3_psmc
REF="ref/GCF_018350335.1_PunCin_v1.0_genomic.fna"
BAM="1_alignment/snowleopard.sorted.bam"
DEPTH=30   # approximate mean depth; set max to 2x mean

# Step 1: Call variants with bcftools mpileup + call
# This produces a diploid consensus VCF
bcftools mpileup \
    --min-MQ 30 \
    --min-BQ 20 \
    -Ou \
    -f ${REF} \
    ${BAM} | \
  bcftools call \
    -c \
    --output-type z \
    --output 2_consensus/snowleopard.vcf.gz

bcftools index 2_consensus/snowleopard.vcf.gz

# Step 2: Convert VCF to consensus FASTA
# Mask sites with depth < 5 or > 2x mean (repetitive regions)
vcfutils.pl vcf2fq \
    -d 5 \
    -D $((DEPTH * 2)) \
    <(bcftools view 2_consensus/snowleopard.vcf.gz) | \
  gzip > 2_consensus/snowleopard_consensus.fa.gz

# Step 3: Convert consensus FASTA to PSMC input format (.psmcfa)
# fq2psmcfa encodes heterozygous/homozygous state in 100 bp bins
fq2psmcfa -q 20 2_consensus/snowleopard_consensus.fa.gz > 3_psmc/snowleopard.psmcfa

# Check the psmcfa (each line = one 100 bp bin; T/t = het/hom, N = missing)
head -5 3_psmc/snowleopard.psmcfa
wc -l 3_psmc/snowleopard.psmcfa
Download:
Consensus FASTA and psmcfa decoded
📄bcftools call -cConsensus caller mode: for each site, outputs the most likely diploid genotype. This is simpler than the gVCF mode needed for GATK joint calling — PSMC only needs to know if a site is heterozygous, not the exact alleles.
🎯vcfutils.pl vcf2fq -d 5 -D 60Converts VCF to a diploid FASTQ where quality scores encode confidence. -d 5 masks sites with depth <5 (unreliable calls). -D 60 masks ultra-high-depth sites (likely in repetitive regions where reads pile up falsely). The output FASTQ is then fed to fq2psmcfa.
📊fq2psmcfa -q 20Encodes the consensus FASTQ into PSMC’s compact binary format. Each position is encoded as: T (confident heterozygous), t (confident homozygous), N (masked/insufficient data). The -q 20 flag requires Phred Q≥20 to call a site. Windows of 100 bp are encoded as one character.

Step 3 — Run PSMC

Bash — PSMC main run
module load psmc/0.6.5

# Run PSMC
# -N 30: max 30 EM iterations
# -t 15: upper limit on 2*N0*t (coalescent time)
# -r 5:  upper limit on theta/rho ratio
# -p "4+25*2+4+6": time segment pattern (standard for mammalian genomes)
psmc \
    -N 30 \
    -t 15 \
    -r 5 \
    -p "4+25*2+4+6" \
    -o 3_psmc/snowleopard.psmc \
    3_psmc/snowleopard.psmcfa

echo "PSMC complete. Output: 3_psmc/snowleopard.psmc"

# Quick check: view the last iteration's Ne estimates
# (columns: time interval, lambda = relative Ne, pi = theta, rho = recombination rate)
grep "^RS" 3_psmc/snowleopard.psmc | tail -30
Download:
PSMC parameters decoded
📈-p "4+25*2+4+6"The time segment pattern defines how coalescent time is divided into intervals. 4+25*2+4+6 = 4 short intervals, then 25 intervals of length 2, then 4 short, then 6 long. This gives 64 intervals total. More intervals = finer resolution but more parameters. The standard 4+25*2+4+6 pattern is recommended for mammalian genomes.
🌟-t 15Maximum coalescent time in units of 2Nμ generations. Setting this too low truncates ancient history; too high wastes intervals on time periods with no data. 15 is appropriate for most mammals. For very ancient divergences (>5 Mya), increase to 20–25.
📊-r 5Upper bound on the θ/ρ ratio (mutation rate / recombination rate). Set to 5 for mammals. Species with very high recombination (e.g. plants) may need a lower value (2–3). PSMC uses recombination to separate closely-related heterozygous sites.
🕐
Runtime: PSMC on a mammalian genome (~3 Gb) takes 2–6 hours on a single core. Each EM iteration reduces the likelihood. Check the log — if the likelihood decreases after iteration 20+, the model has converged. Unlike BEAST/MrBayes, PSMC is deterministic (no MCMC) so no chain convergence diagnostics are needed.

Step 4 — Bootstrap PSMC

Bash — Generate 100 bootstrap replicates for confidence intervals
mkdir -p 3_psmc/bootstrap

# splitfa splits the psmcfa into ~5 Mb segments for bootstrapping
splitfa 3_psmc/snowleopard.psmcfa > 3_psmc/snowleopard_split.psmcfa

# Generate 100 bootstrap replicates (shuffle and resample segments)
# Run as a loop — each bootstrap takes ~2-6h; use an HPC array in practice
for i in $(seq 1 100); do
    psmc \
        -N 30 \
        -t 15 \
        -r 5 \
        -b \
        -p "4+25*2+4+6" \
        -o 3_psmc/bootstrap/round${i}.psmc \
        3_psmc/snowleopard_split.psmcfa
done

echo "All bootstrap runs complete."

# Combine original + bootstraps for plotting
cat 3_psmc/snowleopard.psmc 3_psmc/bootstrap/*.psmc > 3_psmc/combined.psmc
Download:
Bootstrap strategy decoded
splitfa → bootstrap resamplingPSMC bootstraps by resampling chromosomal segments (not individuals). splitfa cuts the psmcfa into ~5 Mb segments, and the -b flag in PSMC causes it to randomly resample segments with replacement. 100 replicates give stable 95% CI bands on the Ne plot. Running 100 bootstraps is compute-intensive — submit as a SLURM array job.
📊cat combined.psmcConcatenate the original + all bootstrap runs into one file. The psmc_plot.pl script (next step) automatically recognises each run as a separate record separated by // delimiters and draws the bootstraps as shaded ribbons around the main estimate.

Step 5 — Plot PSMC Results

Bash + R — Convert PSMC output to real time and plot
MU=2.8e-9    # mutation rate per site per generation (felid)
G=4          # generation time in years

# psmc_plot.pl: convert coalescent time to real time
# -u: per-generation per-site mutation rate
# -g: generation time in years
# -R: output raw data (no plotting)
psmc_plot.pl \
    -u ${MU} \
    -g ${G} \
    -R \
    3_psmc/psmc_plot \
    3_psmc/combined.psmc

# The above creates psmc_plot.0.txt (main) and psmc_plot.1.txt ... (bootstraps)
# Plot with R for publication quality:
Rscript - <<'EOF'
library(ggplot2)
library(tidyverse)

mu <- 2.8e-9
g  <- 4

# Read main PSMC result (psmc_plot.pl -R output has cols: time(yr), Ne)
main_psmc <- read.table("3_psmc/psmc_plot.0.txt", col.names=c("time","Ne"))

# Read all bootstrap files
boot_files <- list.files("3_psmc", pattern="psmc_plot\\.[0-9]+\\.txt", full.names=TRUE)
boots <- map_dfr(boot_files[-1], function(f) {
    d <- read.table(f, col.names=c("time","Ne"))
    d$rep <- f
    d
})

ggplot() +
    geom_step(data=boots, aes(x=time/1e3, y=Ne/1e3, group=rep),
              color="steelblue", alpha=0.15, linewidth=0.3) +
    geom_step(data=main_psmc, aes(x=time/1e3, y=Ne/1e3),
              color="#0d6efd", linewidth=1.2) +
    scale_x_log10(labels=scales::comma) +
    scale_y_log10(labels=scales::comma) +
    labs(
        x="Time (thousand years ago)",
        y="Effective population size Ne (thousands)",
        title="Snow Leopard — PSMC Demographic History",
        subtitle=paste0("µ=", mu, " per site/gen, g=", g, " years")
    ) +
    annotation_logticks(sides="bl") +
    theme_bw(base_size=13)

ggsave("5_plots/psmc_plot.png", width=10, height=6, dpi=150)
EOF
Download:
PSMC plot and time conversion decoded
🕐-u mutation rate -g generation timePSMC estimates time in coalescent units (2Nμ). Converting to years requires: time_years = t_coalescent * 2 * N0 * g, where N0 is the ancestral Ne. The psmc_plot.pl script handles this conversion automatically. Getting μ and g wrong shifts the entire Ne curve on the time axis — these are the biggest sources of uncertainty in PSMC results.
📊scale_x_log10 + scale_y_log10Log-log scale is standard for PSMC plots. It allows both ancient (~millions of years) and recent (~thousands of years) history to be visible simultaneously. Bootstrap replicates are plotted as transparent lines to form a confidence ribbon around the main estimate.
🌑
Expected snow leopard result: PSMC typically shows a large Ne (~100k) during warm interglacials, a bottleneck during the Last Glacial Maximum (~20 kya dropping to ~10–20k), and possible ancient bottlenecks during earlier glacial cycles (~100–200 kya). This correlates with known Pleistocene climate oscillations in Central Asian highlands.

Step 6 — SMC++ (Multi-Sample Inference)

SMC++ uses multiple individuals simultaneously and produces more accurate estimates for recent history (<100 kya), where PSMC has low resolution.

Bash — SMC++ workflow
mkdir -p 4_smcpp
module load smcpp/1.15.4

# SMC++ requires phased/unphased VCF (multi-sample, bgzipped + tabixed)
# If you have multiple snow leopard individuals:
VCF="multi_sample_snowleopard.vcf.gz"
POP_FILE="population_ids.txt"   # one sample ID per line for the population

# Step 1: Convert VCF to SMC++ input format (per chromosome)
for CHR in $(seq 1 19); do
    smc++ vcf2smc \
        --pop snowleopard:${POP_FILE} \
        --mask repeat_mask.bed.gz \
        ${VCF} \
        4_smcpp/snowleopard.chr${CHR}.smc.gz \
        chr${CHR} \
        snowleopard:$(paste -sd',' ${POP_FILE})
done

# Step 2: Estimate Ne using SMC++
smc++ estimate \
    --cores 8 \
    --em-iterations 20 \
    --output 4_smcpp/ \
    2.8e-9 \
    4_smcpp/snowleopard.chr*.smc.gz

# Step 3: Plot the SMC++ model
smc++ plot \
    --csv \
    4_smcpp/smcpp_result.csv \
    4_smcpp/model.final.json

echo "SMC++ model saved in 4_smcpp/model.final.json"
Download:
SMC++ commands decoded
📄smc++ vcf2smc --pop popname:sample_fileConverts a multi-sample VCF into SMC++-format compressed files, one per chromosome. The --pop flag assigns individuals to a named population. --mask specifies regions to exclude (repeats, low-mappability) — strongly recommended to avoid artefacts in repetitive regions.
📊smc++ estimate 2.8e-9Runs the SMC++ optimiser. The positional argument is the mutation rate (μ). Unlike PSMC, SMC++ uses a spline-based model and penalised likelihood — it can resolve recent Ne changes that PSMC cannot. Outputs a JSON model file and CSV of Ne estimates at each time point.

Step 7 — Convert Time to Calendar Years

Time conversion formula: PSMC outputs time in units of 2Nμ. To convert: T_years = t * 2 * N0 * g, where t = coalescent time, N0 = ancestral Ne (from the flat region of the PSMC curve), g = generation time. SMC++ outputs time directly in generations; multiply by g to get years. Both methods are sensitive to assumed μ and g — always show sensitivity analyses by varying these parameters by ±50%.
📊
PSMC vs SMC++ complementarity: PSMC resolves ancient history well (100 kya – 5 Mya) from a single individual, while SMC++ resolves recent history well (1 kya – 100 kya) from multiple individuals. The ideal workflow is to run both and show them on the same plot — they should overlap and agree in the overlapping time range, validating your results.
ADMIXTURE Population Structure Fst & Selective Sweeps