Population Structure with ADMIXTURE

Overview

ADMIXTURE (Alexander et al. 2009) uses a maximum-likelihood approach to decompose each individual’s genotype into proportional contributions from K hypothetical ancestral populations. It is the most widely-used method for population structure inference and produces the familiar “STRUCTURE bar plots” seen in population genetics papers.

Key concepts:

K: number of ancestral populations assumed. Choosing the right K is a statistical and biological question.
Q matrix: per-individual ancestry proportions summing to 1.0. The output you plot as bars.
P matrix: per-population allele frequencies at each SNP. Used for downstream analyses.
Cross-validation error (CV): the metric to select K — choose K at the elbow or minimum.
LD pruning is essential: correlated SNPs violate ADMIXTURE’s independence assumption and inflate apparent structure.

Ideal Directory Structure

admixture_hgdp/

admixture_hgdp/ ├── 0_raw/ # HGDP VCF download ├── 1_plink/ # converted + LD-pruned BED files │ ├── hgdp.bed / .bim / .fam │ └── hgdp_pruned.bed / .bim / .fam ├── 2_admixture/ # ADMIXTURE run output per K │ ├── hgdp_pruned.2.Q # Q matrix for K=2 │ ├── hgdp_pruned.2.P │ ├── hgdp_pruned.7.Q # Q matrix for K=7 (best K) │ └── cv_errors.txt # cross-validation scores └── 3_plots/ # R-generated figures ├── cv_error_plot.png └── structure_plot_K7.png

Dataset

The Human Genome Diversity Panel (HGDP) is a curated collection of 929 individuals from 54 worldwide populations genotyped at ~6 million SNPs. It is the benchmark dataset for population structure methods and covers all major continental groups.

Property	Value
Organism	Homo sapiens
Dataset	HGDP — Bergström et al. 2020, Science
Accession	PRJEB31736 / 1000 Genomes HGDP data portal
Individuals	929 (54 populations, 7 continental regions)
SNPs	~6.4 million high-quality SNPs
Reference	GRCh38

Why HGDP? It is the original dataset used to validate ADMIXTURE (Alexander et al. 2009) and STRUCTURE. The well-known K=7 result (Africa, Europe, Middle East, Central/South Asia, East Asia, Oceania, Americas) makes it ideal for validating your pipeline output.

Step 1 — Download & Filter VCF

Bash — Download HGDP VCF

mkdir -p 0_raw
cd 0_raw

# Download HGDP high-coverage WGS VCF from 1000 Genomes data portal
# Individual chromosome files (~2-8 GB each); start with chr22 for testing
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz.tbi

# Alternative: use the pre-filtered PLINK format release (smaller, recommended)
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz

# Pre-filter: biallelic SNPs only, MAF > 1%, max 5% missing
module load bcftools/1.17
bcftools view \
    --type snps \
    --min-alleles 2 --max-alleles 2 \
    --min-af 0.01:minor \
    hgdp_wgs.20190516.full.chr22.vcf.gz | \
  bcftools view \
    --include 'F_MISSING < 0.05' \
    --output-type z \
    --output hgdp_chr22_filtered.vcf.gz
bcftools index --tbi hgdp_chr22_filtered.vcf.gz

cd ..

Download:

Download and pre-filter decoded

🔨bcftools view --type snps --min-alleles 2 --max-alleles 2	Keep only biallelic SNPs. Removes indels, multiallelic sites, and monomorphic sites. ADMIXTURE strictly requires biallelic SNPs — any other variant type causes errors.
📊--min-af 0.01:minor	Remove sites where the minor allele frequency is below 1%. Extremely rare variants carry little population structure information and increase noise. For ADMIXTURE, 1–5% MAF cutoff is standard.
👥--include 'F_MISSING < 0.05'	bcftools expression to keep sites with <5% missing genotypes. `F_MISSING` is a built-in INFO field computed on the fly from genotype data. More flexible than VCFtools `--max-missing` syntax.

Step 2 — PLINK: Convert & LD Prune

Bash — Convert to PLINK BED and LD prune

mkdir -p 1_plink
module load plink2/2.0

# Merge all chromosomes into one VCF first (if using full genome)
# bcftools concat 0_raw/hgdp_chr*_filtered.vcf.gz -Oz -o 0_raw/hgdp_all_filtered.vcf.gz
# For chr22 test:
VCF="0_raw/hgdp_chr22_filtered.vcf.gz"

# Convert VCF -> PLINK BED
plink2 \
    --vcf ${VCF} \
    --make-bed \
    --out 1_plink/hgdp \
    --allow-extra-chr \
    --set-all-var-ids '@_#' \
    --snps-only just-acgt

echo "Total SNPs: $(wc -l < 1_plink/hgdp.bim)"
echo "Samples:    $(wc -l < 1_plink/hgdp.fam)"

# LD pruning: window 200 SNPs, step 25, r^2 threshold 0.1
# Strict (r^2 < 0.1) is required for ADMIXTURE
plink2 \
    --bfile 1_plink/hgdp \
    --indep-pairwise 200 25 0.1 \
    --allow-extra-chr \
    --out 1_plink/ld_prune

# Extract pruned SNPs
plink2 \
    --bfile 1_plink/hgdp \
    --extract 1_plink/ld_prune.prune.in \
    --make-bed \
    --allow-extra-chr \
    --out 1_plink/hgdp_pruned

echo "SNPs after LD pruning: $(wc -l < 1_plink/hgdp_pruned.bim)"

# ADMIXTURE requires no chromosome code "0" — recode if needed
awk '{$1=0; print $0}' 1_plink/hgdp_pruned.bim > 1_plink/hgdp_pruned_recode.bim
cp 1_plink/hgdp_pruned.bed 1_plink/hgdp_pruned_recode.bed
cp 1_plink/hgdp_pruned.fam 1_plink/hgdp_pruned_recode.fam

Download:

PLINK LD pruning decoded

🌟--indep-pairwise 200 25 0.1	Window of 200 SNPs, shift 25 SNPs, remove one SNP per pair with r² > 0.1. The larger window (200 vs. 50) and smaller r² cutoff (0.1) gives more aggressive pruning. For a globally diverse human dataset expect ~100k–200k SNPs to remain from millions — this is sufficient for ADMIXTURE.
🔨Recode chromosome to 0	ADMIXTURE 1.3.x cannot handle non-numeric chromosome codes. Setting column 1 of the .bim file to `0` (unknown chromosome) is the standard workaround for datasets with alphanumeric chromosome names.

Step 3 — Run ADMIXTURE (K=2–10)

Bash — ADMIXTURE loop with cross-validation (SLURM array)

#!/bin/bash
#SBATCH --array=2-10
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --job-name=admixture_K%a

mkdir -p 2_admixture
cd 2_admixture

K=${SLURM_ARRAY_TASK_ID}
BED="../1_plink/hgdp_pruned_recode.bed"

# Run ADMIXTURE with cross-validation (5-fold CV)
# -j8 uses 8 threads; --cv=5 performs 5-fold cross-validation
admixture --cv=5 -j8 ${BED} ${K} | tee admixture_K${K}.log

echo "Finished K=${K}"

# After all jobs complete, collect CV errors:
# grep "CV error" 2_admixture/admixture_K*.log | \
#   awk '{print $3, $4}' | sed 's/K=//;s/)://' | sort -n > 2_admixture/cv_errors.txt

Download:

ADMIXTURE flags decoded

🌟--cv=5	5-fold cross-validation. ADMIXTURE holds out 20% of genotypes per fold, trains on the remaining 80%, and computes prediction error on the held-out data. Lower CV error = better fit. The K with minimum CV error is statistically optimal.
👥-j8	Number of CPU threads. ADMIXTURE parallelises the E-step of the EM algorithm across threads. Scales well to ~16 cores; beyond that returns diminish. Set to the number of cores you’ve requested in SLURM.
📊Output: .Q and .P files	The `.Q` file has one row per individual and K columns of ancestry proportions (sum to 1.0). The `.P` file has one row per SNP and K columns of allele frequencies in each ancestral population. You typically only need `.Q` for plotting.

🕐

Runtime note: For 929 individuals and ~150k SNPs, each K takes ~30–60 minutes on 8 cores. Run K=2–10 as a SLURM array so all jobs run in parallel. For non-HPC environments, run sequentially in a loop: for K in $(seq 2 10); do admixture --cv=5 -j4 hgdp_pruned.bed $K; done

Step 4 — Cross-Validation & Select K

Bash + R — Collect CV errors and plot

## Bash: collect CV errors from log files
grep "CV error" 2_admixture/admixture_K*.log | \
    awk '{print $3, $4}' | \
    sed 's/K=//;s/)://' | \
    sort -n > 2_admixture/cv_errors.txt

cat 2_admixture/cv_errors.txt
# Expected output:
# 2 0.4821
# 3 0.4512
# 7 0.4201   <- lowest = best K
# 10 0.4350

Download:

CV error selection decoded

🎯Minimum CV error = best statistical K	The K at the minimum cross-validation error is statistically optimal. However, biological context matters — for the HGDP dataset, K=7 matches known continental groups and is biologically interpretable, even if K=5 sometimes has slightly lower CV error depending on the SNP panel used.
📊Elbow vs. minimum	Sometimes the CV error curve has an “elbow” (inflection point) before a flat region. Both the elbow K and the minimum K are defensible choices. Report both in your methods section and show multiple K values in supplementary figures.

Step 5 — Structure Bar Plot in R

R — Publication-quality ADMIXTURE structure plot

library(tidyverse)
library(ggplot2)
library(RColorBrewer)

K_BEST <- 7

# Load Q matrix
Q <- read.table(
    paste0("2_admixture/hgdp_pruned_recode.", K_BEST, ".Q"),
    header = FALSE
)
colnames(Q) <- paste0("K", 1:K_BEST)

# Load FAM file for sample IDs
fam <- read.table("1_plink/hgdp_pruned_recode.fam", header=FALSE)
Q$sample_id <- fam$V2

# Load metadata (region, population)
# Download from: https://www.internationalgenome.org/data-portal/data-collection/hgdp
meta <- read.table("metadata_hgdp.txt", header=TRUE, sep="\t") %>%
    rename(sample_id = Sample)

Q_meta <- left_join(Q, meta, by="sample_id") %>%
    arrange(Region, Population, K1)

# Convert to long format for ggplot
Q_long <- Q_meta %>%
    pivot_longer(cols=starts_with("K"), names_to="component", values_to="proportion")

# Order individuals by region/population
Q_long$sample_id <- factor(Q_long$sample_id, levels=unique(Q_meta$sample_id))

# Structure bar plot
ggplot(Q_long, aes(x=sample_id, y=proportion, fill=component)) +
    geom_bar(stat="identity", width=1, position="stack") +
    facet_grid(~ Region, scales="free_x", space="free_x") +
    scale_fill_manual(values=brewer.pal(K_BEST, "Set1")) +
    labs(
        title = paste0("HGDP ADMIXTURE K=", K_BEST),
        x = "Individual", y = "Ancestry proportion"
    ) +
    theme_bw(base_size=11) +
    theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.spacing = unit(0.5, "mm"),
        strip.text = element_text(size=8, face="bold"),
        legend.position = "bottom"
    ) +
    guides(fill=guide_legend(title="Ancestry", nrow=1))

ggsave("3_plots/structure_plot_K7.png", width=16, height=4, dpi=180)

# CV error plot
cv <- read.table("2_admixture/cv_errors.txt", col.names=c("K","CV"))
ggplot(cv, aes(x=K, y=CV)) +
    geom_line(color="#0d6efd") + geom_point(color="#0d6efd", size=3) +
    geom_point(data=cv[which.min(cv$CV),], color="red", size=5, shape=17) +
    scale_x_continuous(breaks=cv$K) +
    labs(x="K (number of populations)", y="Cross-validation error",
         title="ADMIXTURE CV error by K") +
    theme_bw(base_size=13)
ggsave("3_plots/cv_error_plot.png", width=7, height=5, dpi=150)

Download:

Structure plot decoded

📈pivot_longer + geom_bar(stat="identity")	Convert the wide Q matrix (one column per K) to long format, then stack bars. Each individual is one vertical bar; colours show ancestry proportions. The key is ordering individuals by population and region so the plot shows clean transitions between groups.
👥facet_grid(~ Region, scales="free_x", space="free_x")	Creates separate panels per continent, with panel widths proportional to the number of individuals in each region. This prevents large regions (Africa) from visually dominating the plot.

🌠

Expected HGDP K=7 result: Africa (orange), Europe (blue), Middle East (green), Central/South Asia (yellow), East Asia (red), Oceania (purple), Americas (pink). Each continental group should appear as a near-solid colour, with admixed individuals (e.g. Central Asian populations) showing multi-colour bars. This is the “canonical” ADMIXTURE result first shown by Alexander et al. 2009.

Population Structure with ADMIXTURE

On this page

Overview

Ideal Directory Structure

Dataset

Step 1 — Download & Filter VCF

Step 2 — PLINK: Convert & LD Prune

Step 3 — Run ADMIXTURE (K=2–10)

Step 4 — Cross-Validation & Select K

Step 5 — Structure Bar Plot in R