Convert VCF to TreeMix format · Build ML population tree · Add migration edges · Determine optimal M with optM · Visualise graphs & residuals
TreeMix (Pickrell & Pritchard 2012) models population history as a maximum-likelihood graph — a bifurcating tree augmented with directed migration edges representing gene flow events. Unlike ADMIXTURE (which clusters individuals) or D-statistics (which test specific four-taxon configurations), TreeMix simultaneously infers both the tree structure and the location and magnitude of gene flow across many populations at once.
The Human Genome Diversity Panel (HGDP) is the original dataset used in the TreeMix paper (Pickrell & Pritchard 2012). Running TreeMix on HGDP and recovering the known migration events (Central Asian admixture, Americas–East Asia connection, etc.) is the standard validation for any TreeMix pipeline.
| Property | Value |
|---|---|
| Organism | Homo sapiens |
| Dataset | HGDP (Pickrell & Pritchard 2012 TreeMix dataset) |
| Download | Reich Lab Software page — TreeMix example data |
| Populations | 53 worldwide populations |
| SNPs (LD-pruned) | ~500,000 autosomal SNPs |
| Expected migration events | M=5 typically recovers: Mozabite (Berber + European admixture), Uyghur (East Asian + European), Native Americans (East Asian source) |
######################################################################
## STEP 1: Download the HGDP dataset and prepare LD-pruned PLINK files
## TreeMix requires population-level allele frequencies, not individual
## genotypes — we will aggregate per-population in Step 2.
## TreeMix works best with LD-pruned SNPs to avoid correlated markers
## inflating apparent structure.
######################################################################
mkdir -p 0_raw 1_plink 2_treemix_input 3_runs 4_bootstrap 5_plots
module load plink2/2.0
## --- Option A: Download the exact TreeMix paper dataset (HGDP) ---
## Available from the Reich Lab software page (direct PLINK format)
wget -P 0_raw/ \
https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/\
HGDP_940_plink.tar.gz
tar -xzf 0_raw/HGDP_940_plink.tar.gz -C 0_raw/
## This extracts: HGDP_940.bed, HGDP_940.bim, HGDP_940.fam, HGDP_940.clust
## The .clust file maps each individual to their population — used by TreeMix
## --- Option B: Use the HGDP WGS VCF from ADMIXTURE tutorial (if already downloaded) ---
## (then proceed to the PLINK convert + LD prune steps below)
## --- LD pruning (if starting from existing PLINK files) ---
## -bfile : prefix for PLINK binary files (.bed/.bim/.fam)
## --indep-pairwise 50 10 0.1
## 50 = window size in SNPs (scan 50 SNPs at a time)
## 10 = step size (move 10 SNPs forward each scan)
## 0.1 = r-squared threshold (remove one SNP from any pair with r^2 > 0.1)
## --allow-extra-chr : needed for non-standard chromosome names
plink2 \
--bfile 0_raw/HGDP_940 \
--indep-pairwise 50 10 0.1 \
--allow-extra-chr \
--out 1_plink/ld_prune
## Extract only the LD-pruned SNPs into a new PLINK binary file
## --extract : keep only SNPs listed in the .prune.in file
plink2 \
--bfile 0_raw/HGDP_940 \
--extract 1_plink/ld_prune.prune.in \
--make-bed \
--allow-extra-chr \
--out 1_plink/hgdp_pruned
## Report how many SNPs remain after pruning
echo "SNPs before LD pruning: $(wc -l < 0_raw/HGDP_940.bim)"
echo "SNPs after LD pruning: $(wc -l < 1_plink/hgdp_pruned.bim)"
echo "Samples: $(wc -l < 1_plink/hgdp_pruned.fam)"
## Copy the population cluster file (maps individual → population)
cp 0_raw/HGDP_940.clust 1_plink/hgdp.clust
echo "Population clusters available: $(awk '{print $3}' 1_plink/hgdp.clust | sort -u | wc -l)"
| 📄.clust file format | The PLINK cluster file has three columns: FID (family ID), IID (individual ID), and cluster/population name. TreeMix uses this to group individuals into populations and compute population-level allele frequencies. Every individual must have a population assignment — unassigned individuals are excluded. |
| 🌟Why LD pruning for TreeMix? | TreeMix models the covariance matrix of allele frequencies between populations. Correlated SNPs (in LD) add redundant information that inflates the apparent covariance and can pull migration edges to spurious locations. LD pruning ensures each SNP contributes independent information. Aim for ~200k–500k pruned SNPs for stable TreeMix results. |
######################################################################
## STEP 2: Convert PLINK binary format to TreeMix input format
## TreeMix input format:
## - One row per SNP
## - One column per population
## - Each cell = "count_allele1,count_allele2" (allele counts, not frequencies)
## - First row = population names (space-separated header)
## - File must be bgzipped
## The plink2treemix.py script automates this conversion
######################################################################
## --- Download the conversion script ---
## Part of the TreeMix distribution or available separately
wget -O plink2treemix.py \
https://raw.githubusercontent.com/gavinband/bingwa/master/utils/plink2treemix.py
## --- Run the conversion ---
## This script:
## 1. Reads the PLINK .bed file (binary genotype matrix)
## 2. Groups individuals by population using the .clust file
## 3. For each SNP and each population: counts reference and alternate alleles
## 4. Outputs the count matrix in TreeMix format
## --bfile : PLINK binary file prefix
## --pop : cluster file mapping individuals to populations
## --out : output file prefix
python plink2treemix.py \
--bfile 1_plink/hgdp_pruned \
--pop 1_plink/hgdp.clust \
--out 2_treemix_input/hgdp_treemix
## Output: hgdp_treemix.treemix.gz (bgzipped TreeMix format)
## --- Verify the output format ---
## The first line should be population names (space-separated)
## Subsequent lines: allele counts per population per SNP
zcat 2_treemix_input/hgdp_treemix.treemix.gz | head -3
## Count SNPs and populations
echo "Number of SNPs: $(zcat 2_treemix_input/hgdp_treemix.treemix.gz | tail -n +2 | wc -l)"
echo "Number of pops: $(zcat 2_treemix_input/hgdp_treemix.treemix.gz | head -1 | wc -w)"
| 📊Allele counts not frequencies | TreeMix input uses raw allele counts (e.g. "14,36" means 14 copies of allele 1 and 36 copies of allele 2 in that population) rather than frequencies. This allows TreeMix to properly account for sample size differences between populations when estimating the covariance matrix. Populations with more individuals get more statistical weight. |
| 📄bgzipped output | The output file must be bgzipped (block-compressed) — TreeMix reads it using the bgzf library for efficiency. Regular gzip files will cause TreeMix to crash. If plink2treemix.py does not bgzip automatically, run: bgzip hgdp_treemix.treemix |
######################################################################
## STEP 3: Run TreeMix with 0 to 10 migration edges
## We run multiple values of M (number of migration edges) to:
## 1. Find out how many migrations improve the model fit
## 2. Identify when adding more edges stops helping (diminishing returns)
## We also run multiple replicates per M to check reproducibility
## (TreeMix uses a random starting tree, so results vary between runs)
######################################################################
## TreeMix installation: https://bitbucket.org/nygcresearch/treemix/wiki/Home
## Or via conda: conda install -c bioconda treemix
INPUT="2_treemix_input/hgdp_treemix.treemix.gz"
OUTGROUP="Yoruba" # root the tree using an African population as outgroup
## --- Run 5 replicates for each value of M from 0 to 10 ---
## Running multiple replicates lets us check if the same topology is recovered
## consistently (reproducible = more reliable result)
for M in $(seq 0 10); do
mkdir -p 3_runs/m${M}
for REP in $(seq 1 5); do
echo "Running M=${M}, replicate ${REP}..."
## treemix flags explained:
## -i : input file (bgzipped TreeMix format from Step 2)
## -o : output file prefix (creates multiple output files)
## -root : population to use as tree root/outgroup
## -m : number of migration edges to add after the base tree
## (M=0 = pure tree with no gene flow)
## -k 500 : block size for jackknife standard error estimation
## (blocks of 500 SNPs; accounts for LD in standard error)
## -seed : random seed for reproducibility within replicate
## (different seeds give different starting trees)
## -global : run global rearrangements (slower but more thorough)
## (recommended for >20 populations to avoid local optima)
treemix \
-i ${INPUT} \
-o 3_runs/m${M}/hgdp_m${M}_rep${REP} \
-root ${OUTGROUP} \
-m ${M} \
-k 500 \
-seed $((M * 100 + REP)) \
-global
echo " Output: 3_runs/m${M}/hgdp_m${M}_rep${REP}.*"
done
done
echo "All TreeMix runs complete."
## --- Check log likelihood for each run ---
## Higher log-likelihood = better model fit
## Compare across replicates of the same M (similar = converged)
for M in $(seq 0 10); do
echo -n "M=${M} log-likelihoods: "
for REP in $(seq 1 5); do
tail -1 3_runs/m${M}/hgdp_m${M}_rep${REP}.llik | awk '{printf "%.1f ", $1}'
done
echo ""
done
| 🌳-root Yoruba | Specifies the outgroup population to root the tree. TreeMix builds an unrooted tree then places the root on the branch leading to this population. For human datasets, an African population (Yoruba, Mbuti) is used because modern humans originated in Africa — this places the root correctly on the African branch. |
| 📊-k 500 | Block size (in SNPs) for the block jackknife standard error estimation. Larger blocks = more conservative (wider) confidence intervals but capture more LD structure. 500 SNPs (~250 kb for humans) is the standard for human datasets. For species with less LD (e.g. plants), use smaller blocks (100–200). |
| 🌎-global | After each migration edge is added, TreeMix performs global rearrangements — trying all possible tree topologies, not just local modifications. This is computationally expensive but avoids being trapped in local optima. Essential when analysing >20 populations where many tree topologies are possible. |
| 🏧Multiple replicates | TreeMix starts from a random neighbour-joining tree and optimises from there. Different starting trees can give different final topologies (local optima). Running 5–10 replicates and taking the best log-likelihood (or checking that replicates agree) is essential for confidence in the result. |
--array=0-10 with M=$SLURM_ARRAY_TASK_ID.######################################################################
## STEP 4: Use the optM R package to determine the optimal number of
## migration edges (M).
## optM evaluates three criteria to select the best M:
## 1. Likelihood improvement: does adding another edge improve the fit?
## 2. Explained variance: what fraction of variance is explained?
## 3. Second-order rate of change (Evanno-like): elbow detection
######################################################################
## Install optM: install.packages("optM")
library(optM)
library(ggplot2)
## --- Read all TreeMix output files ---
## optM::optM() reads .llik files (log-likelihoods) from each run
## and computes the statistics to select optimal M
## Specify the directory containing all TreeMix runs
## optM expects a naming convention: *_m[0-10]_rep[1-5].*
runs_dir <- "3_runs/"
## Run optM analysis
## folder : directory containing TreeMix output files
## tsv : file listing all output prefixes (one per line)
## Create this list with:
## find 3_runs/ -name "*.llik" > 3_runs/all_runs.txt
## Generate the file list
system("find 3_runs/ -name '*.llik' | sed 's/.llik//' > 3_runs/all_prefixes.txt")
## Run optM
opt_result <- optM(
folder = runs_dir,
tsv = "3_runs/all_prefixes.txt"
)
## --- View the results table ---
## Columns: m, mean_lnL, stdev_lnL, Lm, Lm', Lm'', delta_m, pct_variance
print(opt_result)
## The optimal M is at the elbow of the Lm'' curve (like Evanno method for STRUCTURE)
## OR at the M where pct_variance levels off (no longer improves substantially)
cat("\nRecommended M:", opt_result$m[which.max(opt_result$Lm2)], "\n")
cat("% variance explained at recommended M:",
opt_result$pct_var[which.max(opt_result$Lm2)], "\n")
## --- Plot the optM diagnostic figure ---
## This shows all three criteria together on one plot
plot_optM(opt_result,
method = "Evanno", # Evanno method = second derivative approach
pdf.file = "5_plots/optM_plot.pdf")
## Also save as PNG for easier viewing
## (re-plot using ggplot2 for customisation)
df <- data.frame(
m = opt_result$m,
mean_lnL = opt_result$mean_lnL,
pct_var = opt_result$pct_var,
delta_m = opt_result$Lm2 # second-order rate of change (Evanno-like)
)
## Panel 1: Log-likelihood improvement per migration edge
p1 <- ggplot(df, aes(x=m, y=mean_lnL)) +
geom_line(color="steelblue", linewidth=1) +
geom_point(color="steelblue", size=3) +
geom_errorbar(aes(ymin=mean_lnL-opt_result$stdev_lnL,
ymax=mean_lnL+opt_result$stdev_lnL), width=0.2) +
labs(x="Migration events (M)", y="Mean log-likelihood",
title="Log-likelihood by number of migration edges") +
theme_bw(base_size=12)
## Panel 2: Percentage variance explained
p2 <- ggplot(df, aes(x=m, y=pct_var)) +
geom_line(color="forestgreen", linewidth=1) +
geom_point(color="forestgreen", size=3) +
labs(x="Migration events (M)", y="% variance explained") +
theme_bw(base_size=12)
## Panel 3: Second-order rate of change (peak = optimal M)
p3 <- ggplot(df[-1,], aes(x=m[-1], y=delta_m[-1])) +
geom_line(color="firebrick", linewidth=1) +
geom_point(color="firebrick", size=3) +
geom_vline(xintercept=df$m[which.max(df$delta_m)],
linetype="dashed", color="black") +
labs(x="Migration events (M)", y="Delta M (2nd derivative)") +
theme_bw(base_size=12)
library(patchwork)
ggsave("5_plots/optM_plot.png", p1 / p2 / p3, width=8, height=12, dpi=150)
| 📈Second derivative (Evanno method) | The second-order rate of change of the log-likelihood identifies the M where adding more edges stops improving the model. This “elbow” is analogous to the Evanno method for choosing K in STRUCTURE. The M at the peak of the Lm” curve is statistically optimal. |
| 🎯% variance explained | TreeMix also reports the fraction of variance in the allele frequency covariance matrix explained by the model. When this plateaus (<1% improvement per additional M), you have enough migration edges. For the HGDP dataset, M=5–6 typically explains >99.9% of variance. |
######################################################################
## STEP 5: Visualise the TreeMix population graph
## TreeMix comes with R plotting scripts (plotting_funcs.R)
## We use these to create the tree + migration arrow plot
## Migration arrows: coloured by weight (0 = weak, 1 = complete admixture)
######################################################################
## Source the TreeMix plotting functions
## These are distributed with TreeMix in the src/ folder
source("plotting_funcs.R") # adjust path to where you installed TreeMix
## Choose the best replicate for the optimal M (highest log-likelihood)
## For this example: M=5, replicate 1
M_BEST <- 5
REP_BEST <- 1
PREFIX <- paste0("3_runs/m", M_BEST, "/hgdp_m", M_BEST, "_rep", REP_BEST)
## --- Plot the tree with migration edges ---
## plot_tree() is from plotting_funcs.R
## it reads: PREFIX.treeout.gz, PREFIX.vertices.gz, PREFIX.edges.gz
## and produces a tree diagram with:
## - Branch lengths proportional to genetic drift
## - Dashed arrows = migration edges
## - Arrow colour = migration weight (0 = cold/blue to 1 = hot/red)
png("5_plots/tree_m5.png", width=1600, height=1200, res=120)
plot_tree(
cex = 0.8, # text size for population labels
PREFIX = PREFIX, # path to TreeMix output files (without extension)
o = "Yoruba", # outgroup population (roots the tree visually)
ybar = 0.1 # extra vertical space for migration arrows
)
title(paste0("HGDP TreeMix: M=", M_BEST, " migration events"))
dev.off()
cat("Tree plot saved: 5_plots/tree_m5.png\n")
## --- Also plot the no-migration (M=0) tree for comparison ---
PREFIX_M0 <- "3_runs/m0/hgdp_m0_rep1"
png("5_plots/tree_m0.png", width=1200, height=1000, res=120)
plot_tree(cex=0.8, PREFIX=PREFIX_M0, o="Yoruba")
title("HGDP TreeMix: M=0 (pure tree, no migration)")
dev.off()
cat("Tree M=0 saved: 5_plots/tree_m0.png\n")
| 📈Branch lengths = genetic drift | In TreeMix, branch lengths represent the amount of genetic drift that occurred along each lineage — longer branch = more drift = smaller population size or longer time. This is different from a molecular clock tree where branch length = time. A long branch to an isolated population (e.g. Papuans) indicates a small effective size or long isolation. |
| 🌈Migration arrow colour | Migration arrow colours encode the weight of the migration event: cold colours (blue/green) = low weight (small fraction of ancestry from that source), warm colours (orange/red) = high weight (large fraction). An arrow from population A to population B with weight 0.15 means: 15% of B’s ancestry came from a lineage close to A. |
######################################################################
## STEP 6: Analyse TreeMix residuals
## Residuals = difference between observed allele frequency covariance
## and what the fitted tree+migration model predicts
## Large positive residuals for a pair of populations mean:
## those two populations share MORE alleles than the model explains
## = evidence of additional unmodelled gene flow between them
## This tells you WHERE to add the next migration edge
######################################################################
## Source plotting functions again (if not already loaded)
source("plotting_funcs.R")
PREFIX <- "3_runs/m5/hgdp_m5_rep1" # optimal M=5 model
OUTGROUP <- "Yoruba"
## --- Plot residuals matrix ---
## plot_resid() produces a heatmap of residuals
## Red = covariance higher than expected (missing gene flow)
## Blue = covariance lower than expected (overfitting/negative drift)
## Pairs to watch: if two populations are red = add a migration edge between them
png("5_plots/residuals_m5.png", width=1400, height=1200, res=120)
plot_resid(
stem = PREFIX, # path prefix for TreeMix output
pop_order = OUTGROUP # reference population for ordering the matrix
)
title("TreeMix residuals: M=5 (red = unexplained shared drift)")
dev.off()
cat("Residuals plot saved: 5_plots/residuals_m5.png\n")
## --- Extract the largest residuals programmatically ---
## The residuals are stored in PREFIX.modelcov.gz (model covariance)
## and PREFIX.cov.gz (observed covariance)
## Compute residuals manually in R for more control:
library(tidyverse)
## Read observed and model covariance matrices
obs_cov <- as.matrix(read.table(gzfile(paste0(PREFIX, ".cov.gz"))))
model_cov <- as.matrix(read.table(gzfile(paste0(PREFIX, ".modelcov.gz"))))
## Compute residuals matrix
resid_mat <- obs_cov - model_cov
## Get population names from vertex file
vertices <- read.table(gzfile(paste0(PREFIX, ".vertices.gz")), header=TRUE)
leaf_pops <- vertices$pop[vertices$leaf == "TRUE"]
rownames(resid_mat) <- colnames(resid_mat) <- leaf_pops
## Convert to long format and find the top 10 pairs with largest residuals
resid_long <- as.data.frame(as.table(resid_mat)) %>%
rename(pop1=Var1, pop2=Var2, residual=Freq) %>%
filter(pop1 != pop2, # exclude diagonal (self-comparisons)
as.character(pop1) < as.character(pop2)) %>% # avoid duplicate pairs
arrange(desc(residual))
cat("Top 10 population pairs with unexplained covariance (candidates for next migration):\n")
print(head(resid_long, 10))
| 📊Residuals = obs − model covariance | TreeMix fits a model to the observed allele frequency covariance between populations. The residual for a population pair is how much their actual covariance exceeds what the model predicts. Positive residual = those populations are more closely related than the tree+migration model accounts for. The population pair with the largest positive residual is the most likely candidate for an additional migration edge. |
| 🎯Using residuals to guide model building | Rather than running M=0 through M=10 blindly, experienced users look at residuals after M=0, identify the two populations with the largest residual, then add a migration edge between them for M=1. Repeat iteratively. This directed approach finds biologically meaningful migration events faster than blind M scanning. |
######################################################################
## STEP 7: Bootstrap TreeMix to determine if migration edges are
## statistically supported.
## Strategy: resample SNP blocks with replacement and re-run TreeMix.
## If a migration edge appears in >50% of bootstrap replicates,
## it is considered well-supported.
######################################################################
INPUT="2_treemix_input/hgdp_treemix.treemix.gz"
M_BEST=5
mkdir -p 4_bootstrap/m${M_BEST}
## --- Generate 100 bootstrap replicates ---
## -bootstrap flag: TreeMix resamples SNP blocks (size = -k) with replacement
## Run 100 replicates (submit as SLURM array for speed)
for i in $(seq 1 100); do
## -bootstrap : enable block bootstrap resampling
## -k 500 : bootstrap block size (same as the main analysis)
## -seed : different seed per replicate for independent resampling
treemix \
-i ${INPUT} \
-o 4_bootstrap/m${M_BEST}/boot_rep${i} \
-root Yoruba \
-m ${M_BEST} \
-k 500 \
-seed ${i} \
-bootstrap \
-global
done
echo "Bootstrap replicates complete."
## --- Compute support for each migration edge in R ---
## Count what fraction of bootstrap replicates contain each migration edge
## Source TreeMix plotting functions
source("plotting_funcs.R")
## get_f() computes the fraction of variance explained by each migration edge
## across all bootstrap replicates
bootstrap_support <- get_f(
stem = paste0("4_bootstrap/m", M_BEST, "/boot_rep"),
nrep = 100, # number of bootstrap replicates
fprefix = paste0("3_runs/m", M_BEST, "/hgdp_m", M_BEST, "_rep1"),
bootstrap = TRUE
)
## View results
## Each row = one migration event in the best tree
## f_boot = fraction of bootstrap replicates where this event was inferred
print(bootstrap_support)
cat("Migration edges with > 50% bootstrap support:\n")
print(bootstrap_support[bootstrap_support$f_boot > 0.5, ])
| 📈-bootstrap flag | Enables block bootstrap resampling: SNP blocks of size -k are sampled with replacement to create a new dataset of the same total size. Variation across 100 bootstrap runs reflects uncertainty in the migration edge placement due to finite SNP sampling. This is analogous to bootstrap support values on phylogenetic trees. |
| 🎯>50% bootstrap support threshold | Migration edges appearing in >50% of bootstrap replicates are considered moderately supported; >80% is strong support. Edges with <50% bootstrap support should be treated with caution — they may reflect statistical noise rather than real gene flow events. For publication, report bootstrap support for all migration edges. |