Population Genetics: Fst, Pi and Tajima D

Core Concepts

Population genetics statistics answer: "How are allele frequencies distributed across populations, and what evolutionary forces shaped them?" Unlike variant calling, these statistics are computed across many individuals simultaneously and summarised in sliding windows.

Statistic	Measures	Range	Biological question answered
π (pi)	Mean pairwise nucleotide differences per site	0-1	How diverse is this population?
F_ST	Fraction of total variance between populations	0-1	How different are two populations genetically?
Tajima's D	Difference between two estimators of theta	~-3 to +3	Is this locus evolving neutrally or under selection?
iHS	Extended haplotype homozygosity score	standardised Z	Is there a recent selective sweep in this population?
XP-EHH	Cross-population extended haplotype homozygosity	standardised Z	Which population shows stronger positive selection?
AMOVA Phi_ST	Hierarchical variance partitioning	0-1	What fraction of diversity is between vs within groups?

Always window-average, never interpret single-SNP values. A single SNP with F_ST=1.0 is noise. A 50 kb window with mean F_ST=0.8 is a true signal of differentiation. Always use sliding windows (typically 50 kb or 100-SNP) before interpreting any of these statistics. vcftools computes windowed statistics natively with --window-pi, --TajimaD, --fst-window-size.

Dataset

We use a subset of the Rice 3000 Genomes Project (3K-RG) comparing Oryza sativa subspecies indica (IND) vs japonica (JAP). These two subspecies diverged ~400,000 years ago and show genome-wide mean F_ST ~0.35-0.45 -- among the highest in any crop species.

Resource	Description	Access
Rice 3K VCF	3024 rice accessions, chromosome-level SNP VCFs	PRJEB6180 at EBI / NCBI
Population labels	IND / JAP / admixed classification	Wang et al. 2018 Nature Genetics Supplementary
Reference	IRGSP-1.0 (Nipponbare)	rapdb.dna.affrc.go.jp

Step 1 — Allele Frequencies & Hardy-Weinberg Equilibrium

Before computing any population genetics statistic, you need allele frequencies per population. The Hardy-Weinberg equilibrium (HWE) test tells you whether your population is randomly mating at a locus. Deviations suggest inbreeding, selection, population structure, or genotyping error.

Hardy-Weinberg Equilibrium + Inbreeding Coefficient

p = (2*N_AA + N_AB) / (2*N_total)

Expected heterozygosity (HWE) = 2pq

F_IS = 1 - (observed_Het / expected_Het)

pFrequency of allele A = (2 x N_AA + N_AB) / (2 x N_total) q = 1-pFrequency of allele B F_IS = 0Random mating (HWE). Rice subspecies are highly selfing so F_IS is typically 0.85-0.95. F_IS < 0Excess heterozygotes -- outbreeding advantage or genotyping error (false heterozygous calls)

Step 2 — Nucleotide Diversity (π)

Nucleotide diversity (π) is the average number of nucleotide differences per site between two randomly drawn sequences. For a biallelic SNP, π = 2pq. For a window, it is the mean 2pq across all sites (including monomorphic sites which contribute 0).

Nucleotide Diversity (per-site and window)

Per biallelic site: pi = 2pq

Window pi = sum(2pq over all S SNPs) / window_length_bp

p, qAllele frequencies at the SNP 2pqExpected heterozygosity -- probability two randomly drawn alleles differ Window lengthDivide by window SIZE not number of SNPs. Windows with fewer SNPs are simply less diverse, not missing data.

Step 3 — F_ST: Population Differentiation

Weir & Cockerham (1984) F_ST is the gold standard for measuring genetic differentiation. It ranges 0 (identical allele frequencies) to 1 (fixed for different alleles). The W-C estimator is unbiased with unequal sample sizes, which is why vcftools --weir-fst-pop is preferred over naive approaches.

F_ST (Wright / simplified W-C formula)

H_T = 2 * p_bar * (1 - p_bar) [total heterozygosity if merged]

H_S = (H_pop1 + H_pop2) / 2 [mean within-population heterozygosity]

F_ST = (H_T - H_S) / H_T

p_barWeighted mean allele frequency across populations = (N1*p1 + N2*p2) / (N1 + N2) F_ST 0-0.05Little differentiation F_ST 0.05-0.15Moderate differentiation F_ST 0.15-0.25Great differentiation F_ST > 0.25Very great differentiation (Wright 1978 scale)

💡

Negative F_ST values are normal and expected. The Weir-Cockerham estimator can produce small negative values (-0.02 to -0.05). This happens when within-population heterozygosity slightly exceeds the expected total heterozygosity due to sampling variance. vcftools sets these to 0 in some outputs, but keep them in window averages to avoid positive bias. Never set negative values to 0 before averaging.

Step 4 — Tajima’s D: The Neutrality Test

Tajima's D (1989) compares two estimators of theta (= 4*Ne*mu): pi (sensitive to intermediate-frequency variants) and Watterson's theta_W (sensitive to all segregating sites equally). Selection or demographic events distort the site-frequency spectrum, causing these two estimates to diverge.

Tajima's D Formula

theta_W = S / a1 where a1 = sum(1/i) for i=1 to n-1

D = (pi - theta_W) / sqrt(Var(pi - theta_W))

SNumber of segregating (polymorphic) sites in the window nNumber of haploid sequences (= 2 x diploid individuals) a1Harmonic number = 1 + 1/2 + 1/3 + ... + 1/(n-1). Normalises S for sample size. D = 0Neutral evolution (theta_W = pi) D < -1.5Excess rare variants: selective sweep OR population expansion after bottleneck D > +1.5Excess intermediate variants: balancing selection OR population subdivision

⚠️

Tajima's D is NOT a selection-only test. A negative D could mean a selective sweep OR a population expansion after a bottleneck (both create excess rare variants). Always combine D with F_ST, iHS/XP-EHH, and demographic modelling (e.g. SMC++ or PSMC for Ne history). The correct interpretation: compute a genome-wide null distribution of D under your demographic model, then flag outlier windows as candidates.

Step 5 — Selection Scans: iHS and XP-EHH

Haplotype-based statistics detect incomplete selective sweeps where a beneficial allele carries a long, unusually homozygous haplotype. They are more powerful than F_ST and Tajima's D for recent sweeps (1,000--25,000 generations ago).

Statistic	Measures	Use when	Tool
iHS	EHH of derived vs ancestral allele within one population	Sweep within a single population; \|iHS\| > 2 = top ~5%	selscan, hapbin
XP-EHH	EHH ratio between two populations for the same allele	Comparing two populations; positive = sweep in pop1, negative = sweep in pop2	selscan, rehh (R)
nSL	Number of segregating sites by length (no genetic map needed)	Non-model organisms without a genetic map	selscan
H12	Haplotype frequency spectrum over 200-SNP windows	Both hard sweeps and soft sweeps (multiple haplotypes)	H12 (Garud et al.)

Bash -- selscan iHS and XP-EHH pipeline (phased VCF required)

## Requires: selscan v1.3+, phased VCF (see haplotype-phasing tutorial)
## Population files: one sample ID per line

## --- Extract per-population phased VCFs ---
for POP in IND JAP; do
  bcftools view --samples-file ${POP}_samples.txt \
    --phased --min-ac 1:minor chr04_phased.vcf.gz | \
    bcftools filter -e 'AF<0.01 || AF>0.99' | \
    bgzip -c > chr04_${POP}_phased.vcf.gz
  tabix -p vcf chr04_${POP}_phased.vcf.gz
done

## --- iHS within each population ---
## selscan needs a genetic map file: SNP_ID Chr pos_bp pos_cM
selscan --ihs \
  --vcf chr04_IND_phased.vcf.gz \
  --map chr04_genetic_map.txt \
  --out chr04_IND_ihs \
  --threads 8

## Normalise by allele frequency bins (REQUIRED before interpretation)
norm --ihs --bins 100 --files chr04_IND_ihs.ihs.out
## Output: .ihs.out.100bins.norm -- |iHS| > 2 = candidate sweep

## --- XP-EHH between populations ---
selscan --xpehh \
  --vcf chr04_IND_phased.vcf.gz \
  --vcf-ref chr04_JAP_phased.vcf.gz \
  --map chr04_genetic_map.txt \
  --out chr04_IND_vs_JAP_xpehh \
  --threads 8

norm --xpehh --bins 100 --files chr04_IND_vs_JAP_xpehh.xpehh.out
## Positive XP-EHH = selection in IND; negative = selection in JAP

## --- Alternative: rehh R package (no command-line selscan needed) ---
## library(rehh)
## hh <- data2haplohh(hap_file="chr04_IND.hap", map_file="chr04.map")
## res_ihs <- scan_hh(hh)
## ihs <- ihh2ihs(res_ihs)
## freqbinplot(ihs)  ## diagnostic plot

Download:

Step 6 — Full vcftools Pipeline

Bash -- Complete population genetics pipeline

## ================================================================
## Full population genetics pipeline using vcftools + PLINK2
## Dataset: Rice 3K chr04, indica vs japonica
## ================================================================

VCF="rice3k_chr04.vcf.gz"

## Step A: Filter VCF
vcftools --gzvcf ${VCF} \
  --remove-indels --maf 0.05 --max-missing 0.8 \
  --min-alleles 2 --max-alleles 2 --minDP 5 \
  --recode --stdout | bgzip -c > rice_filtered.vcf.gz
tabix -p vcf rice_filtered.vcf.gz

## Step B: Nucleotide diversity (50 kb windows, 10 kb step)
for POP in ind jap; do
  vcftools --gzvcf rice_filtered.vcf.gz \
    --keep ${POP}_pops.txt \
    --window-pi 50000 --window-pi-step 10000 \
    --out ${POP}_pi
done

## Step C: Tajima's D (50 kb non-overlapping windows)
for POP in ind jap; do
  vcftools --gzvcf rice_filtered.vcf.gz \
    --keep ${POP}_pops.txt \
    --TajimaD 50000 \
    --out ${POP}_tajima
done
## Only interpret windows with N_SNPS >= 5

## Step D: Weir-Cockerham Fst
vcftools --gzvcf rice_filtered.vcf.gz \
  --weir-fst-pop ind_pops.txt \
  --weir-fst-pop jap_pops.txt \
  --fst-window-size 50000 --fst-window-step 10000 \
  --out ind_vs_jap_fst
## Genome-wide mean and weighted Fst printed to stdout

## Step E: LD pruning + PCA
plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --maf 0.05 --geno 0.05 \
  --indep-pairwise 50 10 0.2 \
  --out rice_ldpruned

plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --extract rice_ldpruned.prune.in \
  --pca 10 --out rice_pca
## rice_pca.eigenvec = PC scores; rice_pca.eigenval = variance explained

## Step F: Admixture (population structure, K=2 to K=6)
## Requires ADMIXTURE v1.3
plink2 --vcf rice_filtered.vcf.gz \
  --allow-extra-chr --double-id \
  --extract rice_ldpruned.prune.in \
  --make-bed --out rice_pruned

for K in 2 3 4 5 6; do
  admixture --cv rice_pruned.bed ${K} | tee log${K}.out
done
## CV error: grep "CV error" log*.out | sort -k3,3n  ==> choose K with min CV

Download:

Step 7 — R Visualisation

R -- Fst Manhattan, pi comparison, Tajima D, PCA, ADMIXTURE barplot

library(ggplot2); library(dplyr); library(tidyr)

## 1. Fst Manhattan plot
fst <- read.table("ind_vs_jap_fst.windowed.weir.fst", header=TRUE, sep="\t")
fst$MEAN_FST[fst$MEAN_FST < 0] <- 0  ## set negative to 0 for display only
fst99 <- quantile(fst$MEAN_FST, 0.99, na.rm=TRUE)
ggplot(fst, aes(BIN_START/1e6, MEAN_FST, colour=MEAN_FST >= fst99)) +
  geom_point(size=0.5, alpha=0.7) +
  geom_hline(yintercept=fst99, linetype="dashed", colour="red") +
  scale_colour_manual(values=c("FALSE"="#6baed6","TRUE"="#e31a1c"), guide=FALSE) +
  facet_grid(.~CHROM, scales="free_x", space="free_x") +
  labs(x="Position (Mb)", y="Fst", title="Indica vs Japonica -- 50 kb window Fst") +
  theme_minimal(base_size=10)
ggsave("fst_manhattan.png", width=14, height=4, dpi=200)

## 2. Pi comparison
pi_ind <- read.table("ind_pi.windowed.pi", header=TRUE); pi_ind$POP <- "Indica"
pi_jap <- read.table("jap_pi.windowed.pi", header=TRUE); pi_jap$POP <- "Japonica"
pi_all <- bind_rows(pi_ind, pi_jap)
ggplot(pi_all, aes(CHROM, PI, fill=POP)) +
  geom_boxplot(outlier.size=0.3, outlier.alpha=0.3) +
  scale_fill_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
  labs(x="Chromosome", y="pi (per site)", title="Nucleotide diversity -- 50 kb windows") +
  theme_bw(base_size=11)
ggsave("pi_comparison.png", width=10, height=5, dpi=200)

## 3. Tajima D genome-wide
td <- read.table("ind_tajima.Tajima.D", header=TRUE)
td <- td[td$N_SNPS >= 5, ]
ggplot(td, aes(TajimaD)) +
  geom_histogram(binwidth=0.1, fill="#4dac26", colour="white") +
  geom_vline(xintercept=c(-2,2), linetype="dotted", colour="red") +
  labs(x="Tajima D", y="Count (windows)", title="Tajima D distribution -- Indica",
       subtitle="Red lines = +-2 (approx 5% tails)") +
  theme_bw(base_size=11)
ggsave("tajimad_distribution.png", width=6, height=4, dpi=200)

## 4. PCA
pca_vecs <- read.table("rice_pca.eigenvec", header=FALSE)
pca_vals <- read.table("rice_pca.eigenval", header=FALSE)
ind_ids  <- scan("ind_pops.txt", what="", quiet=TRUE)
pca_vecs$POP <- ifelse(pca_vecs$V1 %in% ind_ids, "Indica", "Japonica")
pct  <- round(pca_vals$V1/sum(pca_vals$V1)*100,1)
ggplot(pca_vecs, aes(V3, V4, colour=POP)) +
  geom_point(size=2, alpha=0.7) +
  scale_colour_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
  labs(x=paste0("PC1 (",pct[1],"%)"), y=paste0("PC2 (",pct[2],"%)"),
       colour="Subspecies", title="PCA -- population structure") +
  theme_bw(base_size=12)
ggsave("pca_structure.png", width=7, height=6, dpi=200)

## 5. ADMIXTURE barplot (K=3)
Q3 <- read.table("rice_pruned.3.Q")
pop_labels <- read.table("ind_pops.txt")$V1
Q3$POP <- ifelse(rownames(Q3) %in% pop_labels, "Indica", "Japonica")
Q3_long <- Q3 %>% mutate(ID=row_number()) %>%
  pivot_longer(cols=starts_with("V"), names_to="Ancestry", values_to="Proportion")
ggplot(Q3_long, aes(ID, Proportion, fill=Ancestry)) +
  geom_bar(stat="identity", width=1) +
  facet_grid(.~Q3$POP[Q3_long$ID], scales="free_x", space="free_x") +
  scale_fill_brewer(palette="Set1") +
  labs(x="Individual", y="Ancestry proportion", title="ADMIXTURE K=3") +
  theme_minimal(base_size=10) +
  theme(axis.text.x=element_blank(), strip.text=element_text(face="bold"))
ggsave("admixture_K3.png", width=12, height=3, dpi=200)

Download:

Common Pitfalls

Problem	Symptom	Fix
Unequal sample sizes inflate Fst	Fst always higher for the smaller population	Use Weir-Cockerham (vcftools --weir-fst-pop). W-C is sample-size corrected. Or downsample with vcftools --max-indv.
Missing data biases pi and Tajima D	Windows with many missing genotypes show artificially low pi	Filter with --max-missing 0.9 before computing stats. Exclude windows with fewer than 5 informative SNPs.
Related individuals inflate F_IS	Excess homozygosity even in outcrossing species	Remove relatives (KING kinship > 0.125 = 2nd degree) with plink2 --king-cutoff before all analyses.
iHS inflated in low-recombination regions	Centromeres show iHS > 3 everywhere	Always normalise iHS in allele-frequency bins. Exclude centromeric regions. Use nSL which is more robust to recombination rate variation.
PCA driven by one chromosome arm	Stripe pattern in PCA; a few extreme outliers	LD pruning (--indep-pairwise 50 10 0.2) usually fixes this. Large chromosomal inversions can dominate PC1 -- exclude the inverted region from PCA input.
Tajima D significant genome-wide	Nearly all windows show D < -1.5	Likely population expansion, not genome-wide selection. Run PSMC or SMC++ to infer Ne history and simulate expected D distribution under your demographic model.

All Tutorials Haplotype Phasing

Population Genetics Statistics

On this page

Core Concepts

Dataset

Step 1 — Allele Frequencies & Hardy-Weinberg Equilibrium

Step 2 — Nucleotide Diversity (π)

Step 3 — F_ST: Population Differentiation

Step 4 — Tajima’s D: The Neutrality Test

Step 5 — Selection Scans: iHS and XP-EHH

Step 6 — Full vcftools Pipeline

Step 7 — R Visualisation

Common Pitfalls

Population Genetics Statistics

On this page

Core Concepts

Dataset

Step 1 — Allele Frequencies & Hardy-Weinberg Equilibrium

Step 2 — Nucleotide Diversity (π)

Step 3 — FST: Population Differentiation

Step 4 — Tajima’s D: The Neutrality Test

Step 5 — Selection Scans: iHS and XP-EHH

Step 6 — Full vcftools Pipeline

Step 7 — R Visualisation

Common Pitfalls

Step 3 — F_ST: Population Differentiation