Population Structure with ADMIXTURE

LD pruning & QC · Cross-validation to select K · Ancestry proportion inference · Publication-quality structure bar plots

~90 min Intermediate Bash / R
Byte

Overview

ADMIXTURE (Alexander et al. 2009) uses a maximum-likelihood approach to decompose each individual’s genotype into proportional contributions from K hypothetical ancestral populations. It is the most widely-used method for population structure inference and produces the familiar “STRUCTURE bar plots” seen in population genetics papers.

Byte
Key concepts:
  • K: number of ancestral populations assumed. Choosing the right K is a statistical and biological question.
  • Q matrix: per-individual ancestry proportions summing to 1.0. The output you plot as bars.
  • P matrix: per-population allele frequencies at each SNP. Used for downstream analyses.
  • Cross-validation error (CV): the metric to select K — choose K at the elbow or minimum.
  • LD pruning is essential: correlated SNPs violate ADMIXTURE’s independence assumption and inflate apparent structure.

Ideal Directory Structure

admixture_hgdp/
admixture_hgdp/ ├── 0_raw/ # HGDP VCF download ├── 1_plink/ # converted + LD-pruned BED files │ ├── hgdp.bed / .bim / .fam │ └── hgdp_pruned.bed / .bim / .fam ├── 2_admixture/ # ADMIXTURE run output per K │ ├── hgdp_pruned.2.Q # Q matrix for K=2 │ ├── hgdp_pruned.2.P │ ├── hgdp_pruned.7.Q # Q matrix for K=7 (best K) │ └── cv_errors.txt # cross-validation scores └── 3_plots/ # R-generated figures ├── cv_error_plot.png └── structure_plot_K7.png

Dataset

The Human Genome Diversity Panel (HGDP) is a curated collection of 929 individuals from 54 worldwide populations genotyped at ~6 million SNPs. It is the benchmark dataset for population structure methods and covers all major continental groups.

PropertyValue
OrganismHomo sapiens
DatasetHGDP — Bergström et al. 2020, Science
AccessionPRJEB31736 / 1000 Genomes HGDP data portal
Individuals929 (54 populations, 7 continental regions)
SNPs~6.4 million high-quality SNPs
ReferenceGRCh38
Why HGDP? It is the original dataset used to validate ADMIXTURE (Alexander et al. 2009) and STRUCTURE. The well-known K=7 result (Africa, Europe, Middle East, Central/South Asia, East Asia, Oceania, Americas) makes it ideal for validating your pipeline output.

Step 1 — Download & Filter VCF

Bash — Download HGDP VCF
mkdir -p 0_raw
cd 0_raw

# Download HGDP high-coverage WGS VCF from 1000 Genomes data portal
# Individual chromosome files (~2-8 GB each); start with chr22 for testing
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz.tbi

# Alternative: use the pre-filtered PLINK format release (smaller, recommended)
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/release/20230531/\
hgdp_wgs.20190516.full.chr22.vcf.gz

# Pre-filter: biallelic SNPs only, MAF > 1%, max 5% missing
module load bcftools/1.17
bcftools view \
    --type snps \
    --min-alleles 2 --max-alleles 2 \
    --min-af 0.01:minor \
    hgdp_wgs.20190516.full.chr22.vcf.gz | \
  bcftools view \
    --include 'F_MISSING < 0.05' \
    --output-type z \
    --output hgdp_chr22_filtered.vcf.gz
bcftools index --tbi hgdp_chr22_filtered.vcf.gz

cd ..
Download:
Download and pre-filter decoded
🔨bcftools view --type snps --min-alleles 2 --max-alleles 2Keep only biallelic SNPs. Removes indels, multiallelic sites, and monomorphic sites. ADMIXTURE strictly requires biallelic SNPs — any other variant type causes errors.
📊--min-af 0.01:minorRemove sites where the minor allele frequency is below 1%. Extremely rare variants carry little population structure information and increase noise. For ADMIXTURE, 1–5% MAF cutoff is standard.
👥--include 'F_MISSING < 0.05'bcftools expression to keep sites with <5% missing genotypes. F_MISSING is a built-in INFO field computed on the fly from genotype data. More flexible than VCFtools --max-missing syntax.

Step 2 — PLINK: Convert & LD Prune

Bash — Convert to PLINK BED and LD prune
Download:
PLINK LD pruning decoded
🌟--indep-pairwise 200 25 0.1Window of 200 SNPs, shift 25 SNPs, remove one SNP per pair with r² > 0.1. The larger window (200 vs. 50) and smaller r² cutoff (0.1) gives more aggressive pruning. For a globally diverse human dataset expect ~100k–200k SNPs to remain from millions — this is sufficient for ADMIXTURE.
🔨Recode chromosome to 0ADMIXTURE 1.3.x cannot handle non-numeric chromosome codes. Setting column 1 of the .bim file to 0 (unknown chromosome) is the standard workaround for datasets with alphanumeric chromosome names.

Step 3 — Run ADMIXTURE (K=2–10)

Bash — ADMIXTURE loop with cross-validation (SLURM array)
#!/bin/bash
#SBATCH --array=2-10
#SBATCH -N 1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --job-name=admixture_K%a

mkdir -p 2_admixture
cd 2_admixture

K=${SLURM_ARRAY_TASK_ID}
BED="../1_plink/hgdp_pruned_recode.bed"

# Run ADMIXTURE with cross-validation (5-fold CV)
# -j8 uses 8 threads; --cv=5 performs 5-fold cross-validation
admixture --cv=5 -j8 ${BED} ${K} | tee admixture_K${K}.log

echo "Finished K=${K}"

# After all jobs complete, collect CV errors:
# grep "CV error" 2_admixture/admixture_K*.log | \
#   awk '{print $3, $4}' | sed 's/K=//;s/)://' | sort -n > 2_admixture/cv_errors.txt
Download:
ADMIXTURE flags decoded
🌟--cv=55-fold cross-validation. ADMIXTURE holds out 20% of genotypes per fold, trains on the remaining 80%, and computes prediction error on the held-out data. Lower CV error = better fit. The K with minimum CV error is statistically optimal.
👥-j8Number of CPU threads. ADMIXTURE parallelises the E-step of the EM algorithm across threads. Scales well to ~16 cores; beyond that returns diminish. Set to the number of cores you’ve requested in SLURM.
📊Output: .Q and .P filesThe .Q file has one row per individual and K columns of ancestry proportions (sum to 1.0). The .P file has one row per SNP and K columns of allele frequencies in each ancestral population. You typically only need .Q for plotting.
🕐
Runtime note: For 929 individuals and ~150k SNPs, each K takes ~30–60 minutes on 8 cores. Run K=2–10 as a SLURM array so all jobs run in parallel. For non-HPC environments, run sequentially in a loop: for K in $(seq 2 10); do admixture --cv=5 -j4 hgdp_pruned.bed $K; done

Step 4 — Cross-Validation & Select K

Bash + R — Collect CV errors and plot
## Bash: collect CV errors from log files
grep "CV error" 2_admixture/admixture_K*.log | \
    awk '{print $3, $4}' | \
    sed 's/K=//;s/)://' | \
    sort -n > 2_admixture/cv_errors.txt

cat 2_admixture/cv_errors.txt
# Expected output:
# 2 0.4821
# 3 0.4512
# 7 0.4201   <- lowest = best K
# 10 0.4350
Download:
CV error selection decoded
🎯Minimum CV error = best statistical KThe K at the minimum cross-validation error is statistically optimal. However, biological context matters — for the HGDP dataset, K=7 matches known continental groups and is biologically interpretable, even if K=5 sometimes has slightly lower CV error depending on the SNP panel used.
📊Elbow vs. minimumSometimes the CV error curve has an “elbow” (inflection point) before a flat region. Both the elbow K and the minimum K are defensible choices. Report both in your methods section and show multiple K values in supplementary figures.

Step 5 — Structure Bar Plot in R

R — Publication-quality ADMIXTURE structure plot
library(tidyverse)
library(ggplot2)
library(RColorBrewer)

K_BEST <- 7

# Load Q matrix
Q <- read.table(
    paste0("2_admixture/hgdp_pruned_recode.", K_BEST, ".Q"),
    header = FALSE
)
colnames(Q) <- paste0("K", 1:K_BEST)

# Load FAM file for sample IDs
fam <- read.table("1_plink/hgdp_pruned_recode.fam", header=FALSE)
Q$sample_id <- fam$V2

# Load metadata (region, population)
# Download from: https://www.internationalgenome.org/data-portal/data-collection/hgdp
meta <- read.table("metadata_hgdp.txt", header=TRUE, sep="\t") %>%
    rename(sample_id = Sample)

Q_meta <- left_join(Q, meta, by="sample_id") %>%
    arrange(Region, Population, K1)

# Convert to long format for ggplot
Q_long <- Q_meta %>%
    pivot_longer(cols=starts_with("K"), names_to="component", values_to="proportion")

# Order individuals by region/population
Q_long$sample_id <- factor(Q_long$sample_id, levels=unique(Q_meta$sample_id))

# Structure bar plot
ggplot(Q_long, aes(x=sample_id, y=proportion, fill=component)) +
    geom_bar(stat="identity", width=1, position="stack") +
    facet_grid(~ Region, scales="free_x", space="free_x") +
    scale_fill_manual(values=brewer.pal(K_BEST, "Set1")) +
    labs(
        title = paste0("HGDP ADMIXTURE K=", K_BEST),
        x = "Individual", y = "Ancestry proportion"
    ) +
    theme_bw(base_size=11) +
    theme(
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.spacing = unit(0.5, "mm"),
        strip.text = element_text(size=8, face="bold"),
        legend.position = "bottom"
    ) +
    guides(fill=guide_legend(title="Ancestry", nrow=1))

ggsave("3_plots/structure_plot_K7.png", width=16, height=4, dpi=180)

# CV error plot
cv <- read.table("2_admixture/cv_errors.txt", col.names=c("K","CV"))
ggplot(cv, aes(x=K, y=CV)) +
    geom_line(color="#0d6efd") + geom_point(color="#0d6efd", size=3) +
    geom_point(data=cv[which.min(cv$CV),], color="red", size=5, shape=17) +
    scale_x_continuous(breaks=cv$K) +
    labs(x="K (number of populations)", y="Cross-validation error",
         title="ADMIXTURE CV error by K") +
    theme_bw(base_size=13)
ggsave("3_plots/cv_error_plot.png", width=7, height=5, dpi=150)
Download:
Structure plot decoded
📈pivot_longer + geom_bar(stat="identity")Convert the wide Q matrix (one column per K) to long format, then stack bars. Each individual is one vertical bar; colours show ancestry proportions. The key is ordering individuals by population and region so the plot shows clean transitions between groups.
👥facet_grid(~ Region, scales="free_x", space="free_x")Creates separate panels per continent, with panel widths proportional to the number of individuals in each region. This prevents large regions (Africa) from visually dominating the plot.
🌠
Expected HGDP K=7 result: Africa (orange), Europe (blue), Middle East (green), Central/South Asia (yellow), East Asia (red), Oceania (purple), Americas (pink). Each continental group should appear as a near-solid colour, with admixed individuals (e.g. Central Asian populations) showing multi-colour bars. This is the “canonical” ADMIXTURE result first shown by Alexander et al. 2009.
VCF Filtering & Pop Stats PSMC & Demographic History