LD pruning & QC · Cross-validation to select K · Ancestry proportion inference · Publication-quality structure bar plots
ADMIXTURE (Alexander et al. 2009) uses a maximum-likelihood approach to decompose each individual’s genotype into proportional contributions from K hypothetical ancestral populations. It is the most widely-used method for population structure inference and produces the familiar “STRUCTURE bar plots” seen in population genetics papers.
The Human Genome Diversity Panel (HGDP) is a curated collection of 929 individuals from 54 worldwide populations genotyped at ~6 million SNPs. It is the benchmark dataset for population structure methods and covers all major continental groups.
| Property | Value |
|---|---|
| Organism | Homo sapiens |
| Dataset | HGDP — Bergström et al. 2020, Science |
| Accession | PRJEB31736 / 1000 Genomes HGDP data portal |
| Individuals | 929 (54 populations, 7 continental regions) |
| SNPs | ~6.4 million high-quality SNPs |
| Reference | GRCh38 |
mkdir -p 0_raw
cd 0_raw
# Download HGDP high-coverage WGS VCF from 1000 Genomes data portal
# Individual chromosome files (~2-8 GB each); start with chr22 for testing
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz.tbi
# Alternative: use the pre-filtered PLINK format release (smaller, recommended)
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz
# Pre-filter: biallelic SNPs only, MAF > 1%, max 5% missing
module load bcftools/1.17
bcftools view \
--type snps \
--min-alleles 2 --max-alleles 2 \
--min-af 0.01:minor \
hgdp_wgs.20190516.full.chr22.vcf.gz | \
bcftools view \
--include 'F_MISSING < 0.05' \
--output-type z \
--output hgdp_chr22_filtered.vcf.gz
bcftools index --tbi hgdp_chr22_filtered.vcf.gz
cd ..
| 🔨bcftools view --type snps --min-alleles 2 --max-alleles 2 | Keep only biallelic SNPs. Removes indels, multiallelic sites, and monomorphic sites. ADMIXTURE strictly requires biallelic SNPs — any other variant type causes errors. |
| 📊--min-af 0.01:minor | Remove sites where the minor allele frequency is below 1%. Extremely rare variants carry little population structure information and increase noise. For ADMIXTURE, 1–5% MAF cutoff is standard. |
| 👥--include 'F_MISSING < 0.05' | bcftools expression to keep sites with <5% missing genotypes. F_MISSING is a built-in INFO field computed on the fly from genotype data. More flexible than VCFtools --max-missing syntax. |
mkdir -p 1_plink
module load plink2/2.0
# Merge all chromosomes into one VCF first (if using full genome)
# bcftools concat 0_raw/hgdp_chr*_filtered.vcf.gz -Oz -o 0_raw/hgdp_all_filtered.vcf.gz
# For chr22 test:
VCF="0_raw/hgdp_chr22_filtered.vcf.gz"
# Convert VCF -> PLINK BED
plink2 \
--vcf ${VCF} \
--make-bed \
--out 1_plink/hgdp \
--allow-extra-chr \
--set-all-var-ids '@_#' \
--snps-only just-acgt
echo "Total SNPs: $(wc -l < 1_plink/hgdp.bim)"
echo "Samples: $(wc -l < 1_plink/hgdp.fam)"
# LD pruning: window 200 SNPs, step 25, r^2 threshold 0.1
# Strict (r^2 < 0.1) is required for ADMIXTURE
plink2 \
--bfile 1_plink/hgdp \
--indep-pairwise 200 25 0.1 \
--allow-extra-chr \
--out 1_plink/ld_prune
# Extract pruned SNPs
plink2 \
--bfile 1_plink/hgdp \
--extract 1_plink/ld_prune.prune.in \
--make-bed \
--allow-extra-chr \
--out 1_plink/hgdp_pruned
echo "SNPs after LD pruning: $(wc -l < 1_plink/hgdp_pruned.bim)"
# ADMIXTURE requires no chromosome code "0" — recode if needed
awk '{$1=0; print $0}' 1_plink/hgdp_pruned.bim > 1_plink/hgdp_pruned_recode.bim
cp 1_plink/hgdp_pruned.bed 1_plink/hgdp_pruned_recode.bed
cp 1_plink/hgdp_pruned.fam 1_plink/hgdp_pruned_recode.fam
| 🌟--indep-pairwise 200 25 0.1 | Window of 200 SNPs, shift 25 SNPs, remove one SNP per pair with r² > 0.1. The larger window (200 vs. 50) and smaller r² cutoff (0.1) gives more aggressive pruning. For a globally diverse human dataset expect ~100k–200k SNPs to remain from millions — this is sufficient for ADMIXTURE. |
| 🔨Recode chromosome to 0 | ADMIXTURE 1.3.x cannot handle non-numeric chromosome codes. Setting column 1 of the .bim file to 0 (unknown chromosome) is the standard workaround for datasets with alphanumeric chromosome names. |
#!/bin/bash
#SBATCH --array=2-10
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --job-name=admixture_K%a
mkdir -p 2_admixture
cd 2_admixture
K=${SLURM_ARRAY_TASK_ID}
BED="../1_plink/hgdp_pruned_recode.bed"
# Run ADMIXTURE with cross-validation (5-fold CV)
# -j8 uses 8 threads; --cv=5 performs 5-fold cross-validation
admixture --cv=5 -j8 ${BED} ${K} | tee admixture_K${K}.log
echo "Finished K=${K}"
# After all jobs complete, collect CV errors:
# grep "CV error" 2_admixture/admixture_K*.log | \
# awk '{print $3, $4}' | sed 's/K=//;s/)://' | sort -n > 2_admixture/cv_errors.txt
| 🌟--cv=5 | 5-fold cross-validation. ADMIXTURE holds out 20% of genotypes per fold, trains on the remaining 80%, and computes prediction error on the held-out data. Lower CV error = better fit. The K with minimum CV error is statistically optimal. |
| 👥-j8 | Number of CPU threads. ADMIXTURE parallelises the E-step of the EM algorithm across threads. Scales well to ~16 cores; beyond that returns diminish. Set to the number of cores you’ve requested in SLURM. |
| 📊Output: .Q and .P files | The .Q file has one row per individual and K columns of ancestry proportions (sum to 1.0). The .P file has one row per SNP and K columns of allele frequencies in each ancestral population. You typically only need .Q for plotting. |
for K in $(seq 2 10); do admixture --cv=5 -j4 hgdp_pruned.bed $K; done## Bash: collect CV errors from log files
grep "CV error" 2_admixture/admixture_K*.log | \
awk '{print $3, $4}' | \
sed 's/K=//;s/)://' | \
sort -n > 2_admixture/cv_errors.txt
cat 2_admixture/cv_errors.txt
# Expected output:
# 2 0.4821
# 3 0.4512
# 7 0.4201 <- lowest = best K
# 10 0.4350
| 🎯Minimum CV error = best statistical K | The K at the minimum cross-validation error is statistically optimal. However, biological context matters — for the HGDP dataset, K=7 matches known continental groups and is biologically interpretable, even if K=5 sometimes has slightly lower CV error depending on the SNP panel used. |
| 📊Elbow vs. minimum | Sometimes the CV error curve has an “elbow” (inflection point) before a flat region. Both the elbow K and the minimum K are defensible choices. Report both in your methods section and show multiple K values in supplementary figures. |
library(tidyverse)
library(ggplot2)
library(RColorBrewer)
K_BEST <- 7
# Load Q matrix
Q <- read.table(
paste0("2_admixture/hgdp_pruned_recode.", K_BEST, ".Q"),
header = FALSE
)
colnames(Q) <- paste0("K", 1:K_BEST)
# Load FAM file for sample IDs
fam <- read.table("1_plink/hgdp_pruned_recode.fam", header=FALSE)
Q$sample_id <- fam$V2
# Load metadata (region, population)
# Download from: https://www.internationalgenome.org/data-portal/data-collection/hgdp
meta <- read.table("metadata_hgdp.txt", header=TRUE, sep="\t") %>%
rename(sample_id = Sample)
Q_meta <- left_join(Q, meta, by="sample_id") %>%
arrange(Region, Population, K1)
# Convert to long format for ggplot
Q_long <- Q_meta %>%
pivot_longer(cols=starts_with("K"), names_to="component", values_to="proportion")
# Order individuals by region/population
Q_long$sample_id <- factor(Q_long$sample_id, levels=unique(Q_meta$sample_id))
# Structure bar plot
ggplot(Q_long, aes(x=sample_id, y=proportion, fill=component)) +
geom_bar(stat="identity", width=1, position="stack") +
facet_grid(~ Region, scales="free_x", space="free_x") +
scale_fill_manual(values=brewer.pal(K_BEST, "Set1")) +
labs(
title = paste0("HGDP ADMIXTURE K=", K_BEST),
x = "Individual", y = "Ancestry proportion"
) +
theme_bw(base_size=11) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.spacing = unit(0.5, "mm"),
strip.text = element_text(size=8, face="bold"),
legend.position = "bottom"
) +
guides(fill=guide_legend(title="Ancestry", nrow=1))
ggsave("3_plots/structure_plot_K7.png", width=16, height=4, dpi=180)
# CV error plot
cv <- read.table("2_admixture/cv_errors.txt", col.names=c("K","CV"))
ggplot(cv, aes(x=K, y=CV)) +
geom_line(color="#0d6efd") + geom_point(color="#0d6efd", size=3) +
geom_point(data=cv[which.min(cv$CV),], color="red", size=5, shape=17) +
scale_x_continuous(breaks=cv$K) +
labs(x="K (number of populations)", y="Cross-validation error",
title="ADMIXTURE CV error by K") +
theme_bw(base_size=13)
ggsave("3_plots/cv_error_plot.png", width=7, height=5, dpi=150)
| 📈pivot_longer + geom_bar(stat="identity") | Convert the wide Q matrix (one column per K) to long format, then stack bars. Each individual is one vertical bar; colours show ancestry proportions. The key is ordering individuals by population and region so the plot shows clean transitions between groups. |
| 👥facet_grid(~ Region, scales="free_x", space="free_x") | Creates separate panels per continent, with panel widths proportional to the number of individuals in each region. This prevents large regions (Africa) from visually dominating the plot. |