Population Genetics Statistics

FST, nucleotide diversity (π), Tajima’s D, iHS/XP-EHH selection scans — every formula unboxed with live interactive spreadsheets. Complete vcftools + PLINK + R pipeline on Rice 3000 Genomes.

~150 min Intermediate-Advanced Live spreadsheets vcftools / PLINK / R
Byte

Core Concepts

Population genetics statistics answer: "How are allele frequencies distributed across populations, and what evolutionary forces shaped them?" Unlike variant calling, these statistics are computed across many individuals simultaneously and summarised in sliding windows.

StatisticMeasuresRangeBiological question answered
π (pi)Mean pairwise nucleotide differences per site0-1How diverse is this population?
FSTFraction of total variance between populations0-1How different are two populations genetically?
Tajima's DDifference between two estimators of theta~-3 to +3Is this locus evolving neutrally or under selection?
iHSExtended haplotype homozygosity scorestandardised ZIs there a recent selective sweep in this population?
XP-EHHCross-population extended haplotype homozygositystandardised ZWhich population shows stronger positive selection?
AMOVA Phi_STHierarchical variance partitioning0-1What fraction of diversity is between vs within groups?
Byte
Always window-average, never interpret single-SNP values. A single SNP with FST=1.0 is noise. A 50 kb window with mean FST=0.8 is a true signal of differentiation. Always use sliding windows (typically 50 kb or 100-SNP) before interpreting any of these statistics. vcftools computes windowed statistics natively with --window-pi, --TajimaD, --fst-window-size.

Dataset

We use a subset of the Rice 3000 Genomes Project (3K-RG) comparing Oryza sativa subspecies indica (IND) vs japonica (JAP). These two subspecies diverged ~400,000 years ago and show genome-wide mean FST ~0.35-0.45 -- among the highest in any crop species.

ResourceDescriptionAccess
Rice 3K VCF3024 rice accessions, chromosome-level SNP VCFsPRJEB6180 at EBI / NCBI
Population labelsIND / JAP / admixed classificationWang et al. 2018 Nature Genetics Supplementary
ReferenceIRGSP-1.0 (Nipponbare)rapdb.dna.affrc.go.jp

Step 1 — Allele Frequencies & Hardy-Weinberg Equilibrium

Before computing any population genetics statistic, you need allele frequencies per population. The Hardy-Weinberg equilibrium (HWE) test tells you whether your population is randomly mating at a locus. Deviations suggest inbreeding, selection, population structure, or genotyping error.

Hardy-Weinberg Equilibrium + Inbreeding Coefficient
p = (2*N_AA + N_AB) / (2*N_total)
Expected heterozygosity (HWE) = 2pq
F_IS = 1 - (observed_Het / expected_Het)
pFrequency of allele A = (2 x N_AA + N_AB) / (2 x N_total) q = 1-pFrequency of allele B F_IS = 0Random mating (HWE). Rice subspecies are highly selfing so F_IS is typically 0.85-0.95. F_IS < 0Excess heterozygotes -- outbreeding advantage or genotyping error (false heterozygous calls)

Step 2 — Nucleotide Diversity (π)

Nucleotide diversity (π) is the average number of nucleotide differences per site between two randomly drawn sequences. For a biallelic SNP, π = 2pq. For a window, it is the mean 2pq across all sites (including monomorphic sites which contribute 0).

Nucleotide Diversity (per-site and window)
Per biallelic site: pi = 2pq
Window pi = sum(2pq over all S SNPs) / window_length_bp
p, qAllele frequencies at the SNP 2pqExpected heterozygosity -- probability two randomly drawn alleles differ Window lengthDivide by window SIZE not number of SNPs. Windows with fewer SNPs are simply less diverse, not missing data.

Step 3 — FST: Population Differentiation

Weir & Cockerham (1984) FST is the gold standard for measuring genetic differentiation. It ranges 0 (identical allele frequencies) to 1 (fixed for different alleles). The W-C estimator is unbiased with unequal sample sizes, which is why vcftools --weir-fst-pop is preferred over naive approaches.

F_ST (Wright / simplified W-C formula)
H_T = 2 * p_bar * (1 - p_bar) [total heterozygosity if merged]
H_S = (H_pop1 + H_pop2) / 2 [mean within-population heterozygosity]
F_ST = (H_T - H_S) / H_T
p_barWeighted mean allele frequency across populations = (N1*p1 + N2*p2) / (N1 + N2) F_ST 0-0.05Little differentiation F_ST 0.05-0.15Moderate differentiation F_ST 0.15-0.25Great differentiation F_ST > 0.25Very great differentiation (Wright 1978 scale)
💡
Negative FST values are normal and expected. The Weir-Cockerham estimator can produce small negative values (-0.02 to -0.05). This happens when within-population heterozygosity slightly exceeds the expected total heterozygosity due to sampling variance. vcftools sets these to 0 in some outputs, but keep them in window averages to avoid positive bias. Never set negative values to 0 before averaging.

Step 4 — Tajima’s D: The Neutrality Test

Tajima's D (1989) compares two estimators of theta (= 4*Ne*mu): pi (sensitive to intermediate-frequency variants) and Watterson's theta_W (sensitive to all segregating sites equally). Selection or demographic events distort the site-frequency spectrum, causing these two estimates to diverge.

Tajima's D Formula
theta_W = S / a1 where a1 = sum(1/i) for i=1 to n-1
D = (pi - theta_W) / sqrt(Var(pi - theta_W))
SNumber of segregating (polymorphic) sites in the window nNumber of haploid sequences (= 2 x diploid individuals) a1Harmonic number = 1 + 1/2 + 1/3 + ... + 1/(n-1). Normalises S for sample size. D = 0Neutral evolution (theta_W = pi) D < -1.5Excess rare variants: selective sweep OR population expansion after bottleneck D > +1.5Excess intermediate variants: balancing selection OR population subdivision
⚠️
Tajima's D is NOT a selection-only test. A negative D could mean a selective sweep OR a population expansion after a bottleneck (both create excess rare variants). Always combine D with FST, iHS/XP-EHH, and demographic modelling (e.g. SMC++ or PSMC for Ne history). The correct interpretation: compute a genome-wide null distribution of D under your demographic model, then flag outlier windows as candidates.

Step 5 — Selection Scans: iHS and XP-EHH

Haplotype-based statistics detect incomplete selective sweeps where a beneficial allele carries a long, unusually homozygous haplotype. They are more powerful than FST and Tajima's D for recent sweeps (1,000--25,000 generations ago).

StatisticMeasuresUse whenTool
iHSEHH of derived vs ancestral allele within one populationSweep within a single population; |iHS| > 2 = top ~5%selscan, hapbin
XP-EHHEHH ratio between two populations for the same alleleComparing two populations; positive = sweep in pop1, negative = sweep in pop2selscan, rehh (R)
nSLNumber of segregating sites by length (no genetic map needed)Non-model organisms without a genetic mapselscan
H12Haplotype frequency spectrum over 200-SNP windowsBoth hard sweeps and soft sweeps (multiple haplotypes)H12 (Garud et al.)
Bash -- selscan iHS and XP-EHH pipeline (phased VCF required)
## Requires: selscan v1.3+, phased VCF (see haplotype-phasing tutorial)
## Population files: one sample ID per line

## --- Extract per-population phased VCFs ---
for POP in IND JAP; do
  bcftools view --samples-file ${POP}_samples.txt \
    --phased --min-ac 1:minor chr04_phased.vcf.gz | \
    bcftools filter -e 'AF<0.01 || AF>0.99' | \
    bgzip -c > chr04_${POP}_phased.vcf.gz
  tabix -p vcf chr04_${POP}_phased.vcf.gz
done

## --- iHS within each population ---
## selscan needs a genetic map file: SNP_ID Chr pos_bp pos_cM
selscan --ihs \
  --vcf chr04_IND_phased.vcf.gz \
  --map chr04_genetic_map.txt \
  --out chr04_IND_ihs \
  --threads 8

## Normalise by allele frequency bins (REQUIRED before interpretation)
norm --ihs --bins 100 --files chr04_IND_ihs.ihs.out
## Output: .ihs.out.100bins.norm -- |iHS| > 2 = candidate sweep

## --- XP-EHH between populations ---
selscan --xpehh \
  --vcf chr04_IND_phased.vcf.gz \
  --vcf-ref chr04_JAP_phased.vcf.gz \
  --map chr04_genetic_map.txt \
  --out chr04_IND_vs_JAP_xpehh \
  --threads 8

norm --xpehh --bins 100 --files chr04_IND_vs_JAP_xpehh.xpehh.out
## Positive XP-EHH = selection in IND; negative = selection in JAP

## --- Alternative: rehh R package (no command-line selscan needed) ---
## library(rehh)
## hh <- data2haplohh(hap_file="chr04_IND.hap", map_file="chr04.map")
## res_ihs <- scan_hh(hh)
## ihs <- ihh2ihs(res_ihs)
## freqbinplot(ihs)  ## diagnostic plot
Download:

Step 6 — Full vcftools Pipeline

Bash -- Complete population genetics pipeline
## ================================================================
## Full population genetics pipeline using vcftools + PLINK2
## Dataset: Rice 3K chr04, indica vs japonica
## ================================================================

VCF="rice3k_chr04.vcf.gz"

## Step A: Filter VCF
vcftools --gzvcf ${VCF} \
  --remove-indels --maf 0.05 --max-missing 0.8 \
  --min-alleles 2 --max-alleles 2 --minDP 5 \
  --recode --stdout | bgzip -c > rice_filtered.vcf.gz
tabix -p vcf rice_filtered.vcf.gz

## Step B: Nucleotide diversity (50 kb windows, 10 kb step)
for POP in ind jap; do
  vcftools --gzvcf rice_filtered.vcf.gz \
    --keep ${POP}_pops.txt \
    --window-pi 50000 --window-pi-step 10000 \
    --out ${POP}_pi
done

## Step C: Tajima's D (50 kb non-overlapping windows)
for POP in ind jap; do
  vcftools --gzvcf rice_filtered.vcf.gz \
    --keep ${POP}_pops.txt \
    --TajimaD 50000 \
    --out ${POP}_tajima
done
## Only interpret windows with N_SNPS >= 5

## Step D: Weir-Cockerham Fst
vcftools --gzvcf rice_filtered.vcf.gz \
  --weir-fst-pop ind_pops.txt \
  --weir-fst-pop jap_pops.txt \
  --fst-window-size 50000 --fst-window-step 10000 \
  --out ind_vs_jap_fst
## Genome-wide mean and weighted Fst printed to stdout

## Step E: LD pruning + PCA
plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --maf 0.05 --geno 0.05 \
  --indep-pairwise 50 10 0.2 \
  --out rice_ldpruned

plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --extract rice_ldpruned.prune.in \
  --pca 10 --out rice_pca
## rice_pca.eigenvec = PC scores; rice_pca.eigenval = variance explained

## Step F: Admixture (population structure, K=2 to K=6)
## Requires ADMIXTURE v1.3
plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --extract rice_ldpruned.prune.in \
  --make-bed --out rice_pruned

for K in 2 3 4 5 6; do
  admixture --cv rice_pruned.bed ${K} | tee log${K}.out
done
## CV error: grep "CV error" log*.out | sort -k3,3n  ==> choose K with min CV
Download:

Step 7 — R Visualisation

R -- Fst Manhattan, pi comparison, Tajima D, PCA, ADMIXTURE barplot
library(ggplot2); library(dplyr); library(tidyr)

## 1. Fst Manhattan plot
fst <- read.table("ind_vs_jap_fst.windowed.weir.fst", header=TRUE, sep="\t")
fst$MEAN_FST[fst$MEAN_FST < 0] <- 0  ## set negative to 0 for display only
fst99 <- quantile(fst$MEAN_FST, 0.99, na.rm=TRUE)
ggplot(fst, aes(BIN_START/1e6, MEAN_FST, colour=MEAN_FST >= fst99)) +
  geom_point(size=0.5, alpha=0.7) +
  geom_hline(yintercept=fst99, linetype="dashed", colour="red") +
  scale_colour_manual(values=c("FALSE"="#6baed6","TRUE"="#e31a1c"), guide=FALSE) +
  facet_grid(.~CHROM, scales="free_x", space="free_x") +
  labs(x="Position (Mb)", y="Fst", title="Indica vs Japonica -- 50 kb window Fst") +
  theme_minimal(base_size=10)
ggsave("fst_manhattan.png", width=14, height=4, dpi=200)

## 2. Pi comparison
pi_ind <- read.table("ind_pi.windowed.pi", header=TRUE); pi_ind$POP <- "Indica"
pi_jap <- read.table("jap_pi.windowed.pi", header=TRUE); pi_jap$POP <- "Japonica"
pi_all <- bind_rows(pi_ind, pi_jap)
ggplot(pi_all, aes(CHROM, PI, fill=POP)) +
  geom_boxplot(outlier.size=0.3, outlier.alpha=0.3) +
  scale_fill_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
  labs(x="Chromosome", y="pi (per site)", title="Nucleotide diversity -- 50 kb windows") +
  theme_bw(base_size=11)
ggsave("pi_comparison.png", width=10, height=5, dpi=200)

## 3. Tajima D genome-wide
td <- read.table("ind_tajima.Tajima.D", header=TRUE)
td <- td[td$N_SNPS >= 5, ]
ggplot(td, aes(TajimaD)) +
  geom_histogram(binwidth=0.1, fill="#4dac26", colour="white") +
  geom_vline(xintercept=c(-2,2), linetype="dotted", colour="red") +
  labs(x="Tajima D", y="Count (windows)", title="Tajima D distribution -- Indica",
       subtitle="Red lines = +-2 (approx 5% tails)") +
  theme_bw(base_size=11)
ggsave("tajimad_distribution.png", width=6, height=4, dpi=200)

## 4. PCA
pca_vecs <- read.table("rice_pca.eigenvec", header=FALSE)
pca_vals <- read.table("rice_pca.eigenval", header=FALSE)
ind_ids  <- scan("ind_pops.txt", what="", quiet=TRUE)
pca_vecs$POP <- ifelse(pca_vecs$V1 %in% ind_ids, "Indica", "Japonica")
pct  <- round(pca_vals$V1/sum(pca_vals$V1)*100,1)
ggplot(pca_vecs, aes(V3, V4, colour=POP)) +
  geom_point(size=2, alpha=0.7) +
  scale_colour_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
  labs(x=paste0("PC1 (",pct[1],"%)"), y=paste0("PC2 (",pct[2],"%)"),
       colour="Subspecies", title="PCA -- population structure") +
  theme_bw(base_size=12)
ggsave("pca_structure.png", width=7, height=6, dpi=200)

## 5. ADMIXTURE barplot (K=3)
Q3 <- read.table("rice_pruned.3.Q")
pop_labels <- read.table("ind_pops.txt")$V1
Q3$POP <- ifelse(rownames(Q3) %in% pop_labels, "Indica", "Japonica")
Q3_long <- Q3 %>% mutate(ID=row_number()) %>%
  pivot_longer(cols=starts_with("V"), names_to="Ancestry", values_to="Proportion")
ggplot(Q3_long, aes(ID, Proportion, fill=Ancestry)) +
  geom_bar(stat="identity", width=1) +
  facet_grid(.~Q3$POP[Q3_long$ID], scales="free_x", space="free_x") +
  scale_fill_brewer(palette="Set1") +
  labs(x="Individual", y="Ancestry proportion", title="ADMIXTURE K=3") +
  theme_minimal(base_size=10) +
  theme(axis.text.x=element_blank(), strip.text=element_text(face="bold"))
ggsave("admixture_K3.png", width=12, height=3, dpi=200)
Download:

Common Pitfalls

ProblemSymptomFix
Unequal sample sizes inflate FstFst always higher for the smaller populationUse Weir-Cockerham (vcftools --weir-fst-pop). W-C is sample-size corrected. Or downsample with vcftools --max-indv.
Missing data biases pi and Tajima DWindows with many missing genotypes show artificially low piFilter with --max-missing 0.9 before computing stats. Exclude windows with fewer than 5 informative SNPs.
Related individuals inflate F_ISExcess homozygosity even in outcrossing speciesRemove relatives (KING kinship > 0.125 = 2nd degree) with plink2 --king-cutoff before all analyses.
iHS inflated in low-recombination regionsCentromeres show iHS > 3 everywhereAlways normalise iHS in allele-frequency bins. Exclude centromeric regions. Use nSL which is more robust to recombination rate variation.
PCA driven by one chromosome armStripe pattern in PCA; a few extreme outliersLD pruning (--indep-pairwise 50 10 0.2) usually fixes this. Large chromosomal inversions can dominate PC1 -- exclude the inverted region from PCA input.
Tajima D significant genome-wideNearly all windows show D < -1.5Likely population expansion, not genome-wide selection. Run PSMC or SMC++ to infer Ne history and simulate expected D distribution under your demographic model.