FST, nucleotide diversity (π), Tajima’s D, iHS/XP-EHH selection scans — every formula unboxed with live interactive spreadsheets. Complete vcftools + PLINK + R pipeline on Rice 3000 Genomes.

Population genetics statistics answer: "How are allele frequencies distributed across populations, and what evolutionary forces shaped them?" Unlike variant calling, these statistics are computed across many individuals simultaneously and summarised in sliding windows.
| Statistic | Measures | Range | Biological question answered |
|---|---|---|---|
| π (pi) | Mean pairwise nucleotide differences per site | 0-1 | How diverse is this population? |
| FST | Fraction of total variance between populations | 0-1 | How different are two populations genetically? |
| Tajima's D | Difference between two estimators of theta | ~-3 to +3 | Is this locus evolving neutrally or under selection? |
| iHS | Extended haplotype homozygosity score | standardised Z | Is there a recent selective sweep in this population? |
| XP-EHH | Cross-population extended haplotype homozygosity | standardised Z | Which population shows stronger positive selection? |
| AMOVA Phi_ST | Hierarchical variance partitioning | 0-1 | What fraction of diversity is between vs within groups? |
We use a subset of the Rice 3000 Genomes Project (3K-RG) comparing Oryza sativa subspecies indica (IND) vs japonica (JAP). These two subspecies diverged ~400,000 years ago and show genome-wide mean FST ~0.35-0.45 -- among the highest in any crop species.
| Resource | Description | Access |
|---|---|---|
| Rice 3K VCF | 3024 rice accessions, chromosome-level SNP VCFs | PRJEB6180 at EBI / NCBI |
| Population labels | IND / JAP / admixed classification | Wang et al. 2018 Nature Genetics Supplementary |
| Reference | IRGSP-1.0 (Nipponbare) | rapdb.dna.affrc.go.jp |
Before computing any population genetics statistic, you need allele frequencies per population. The Hardy-Weinberg equilibrium (HWE) test tells you whether your population is randomly mating at a locus. Deviations suggest inbreeding, selection, population structure, or genotyping error.
Nucleotide diversity (π) is the average number of nucleotide differences per site between two randomly drawn sequences. For a biallelic SNP, π = 2pq. For a window, it is the mean 2pq across all sites (including monomorphic sites which contribute 0).
Weir & Cockerham (1984) FST is the gold standard for measuring genetic differentiation. It ranges 0 (identical allele frequencies) to 1 (fixed for different alleles). The W-C estimator is unbiased with unequal sample sizes, which is why vcftools --weir-fst-pop is preferred over naive approaches.
Tajima's D (1989) compares two estimators of theta (= 4*Ne*mu): pi (sensitive to intermediate-frequency variants) and Watterson's theta_W (sensitive to all segregating sites equally). Selection or demographic events distort the site-frequency spectrum, causing these two estimates to diverge.
Haplotype-based statistics detect incomplete selective sweeps where a beneficial allele carries a long, unusually homozygous haplotype. They are more powerful than FST and Tajima's D for recent sweeps (1,000--25,000 generations ago).
| Statistic | Measures | Use when | Tool |
|---|---|---|---|
| iHS | EHH of derived vs ancestral allele within one population | Sweep within a single population; |iHS| > 2 = top ~5% | selscan, hapbin |
| XP-EHH | EHH ratio between two populations for the same allele | Comparing two populations; positive = sweep in pop1, negative = sweep in pop2 | selscan, rehh (R) |
| nSL | Number of segregating sites by length (no genetic map needed) | Non-model organisms without a genetic map | selscan |
| H12 | Haplotype frequency spectrum over 200-SNP windows | Both hard sweeps and soft sweeps (multiple haplotypes) | H12 (Garud et al.) |
## Requires: selscan v1.3+, phased VCF (see haplotype-phasing tutorial)
## Population files: one sample ID per line
## --- Extract per-population phased VCFs ---
for POP in IND JAP; do
bcftools view --samples-file ${POP}_samples.txt \
--phased --min-ac 1:minor chr04_phased.vcf.gz | \
bcftools filter -e 'AF<0.01 || AF>0.99' | \
bgzip -c > chr04_${POP}_phased.vcf.gz
tabix -p vcf chr04_${POP}_phased.vcf.gz
done
## --- iHS within each population ---
## selscan needs a genetic map file: SNP_ID Chr pos_bp pos_cM
selscan --ihs \
--vcf chr04_IND_phased.vcf.gz \
--map chr04_genetic_map.txt \
--out chr04_IND_ihs \
--threads 8
## Normalise by allele frequency bins (REQUIRED before interpretation)
norm --ihs --bins 100 --files chr04_IND_ihs.ihs.out
## Output: .ihs.out.100bins.norm -- |iHS| > 2 = candidate sweep
## --- XP-EHH between populations ---
selscan --xpehh \
--vcf chr04_IND_phased.vcf.gz \
--vcf-ref chr04_JAP_phased.vcf.gz \
--map chr04_genetic_map.txt \
--out chr04_IND_vs_JAP_xpehh \
--threads 8
norm --xpehh --bins 100 --files chr04_IND_vs_JAP_xpehh.xpehh.out
## Positive XP-EHH = selection in IND; negative = selection in JAP
## --- Alternative: rehh R package (no command-line selscan needed) ---
## library(rehh)
## hh <- data2haplohh(hap_file="chr04_IND.hap", map_file="chr04.map")
## res_ihs <- scan_hh(hh)
## ihs <- ihh2ihs(res_ihs)
## freqbinplot(ihs) ## diagnostic plot
## ================================================================
## Full population genetics pipeline using vcftools + PLINK2
## Dataset: Rice 3K chr04, indica vs japonica
## ================================================================
VCF="rice3k_chr04.vcf.gz"
## Step A: Filter VCF
vcftools --gzvcf ${VCF} \
--remove-indels --maf 0.05 --max-missing 0.8 \
--min-alleles 2 --max-alleles 2 --minDP 5 \
--recode --stdout | bgzip -c > rice_filtered.vcf.gz
tabix -p vcf rice_filtered.vcf.gz
## Step B: Nucleotide diversity (50 kb windows, 10 kb step)
for POP in ind jap; do
vcftools --gzvcf rice_filtered.vcf.gz \
--keep ${POP}_pops.txt \
--window-pi 50000 --window-pi-step 10000 \
--out ${POP}_pi
done
## Step C: Tajima's D (50 kb non-overlapping windows)
for POP in ind jap; do
vcftools --gzvcf rice_filtered.vcf.gz \
--keep ${POP}_pops.txt \
--TajimaD 50000 \
--out ${POP}_tajima
done
## Only interpret windows with N_SNPS >= 5
## Step D: Weir-Cockerham Fst
vcftools --gzvcf rice_filtered.vcf.gz \
--weir-fst-pop ind_pops.txt \
--weir-fst-pop jap_pops.txt \
--fst-window-size 50000 --fst-window-step 10000 \
--out ind_vs_jap_fst
## Genome-wide mean and weighted Fst printed to stdout
## Step E: LD pruning + PCA
plink2 --vcf rice_filtered.vcf.gz \
--allow-extra-chr --double-id \
--maf 0.05 --geno 0.05 \
--indep-pairwise 50 10 0.2 \
--out rice_ldpruned
plink2 --vcf rice_filtered.vcf.gz \
--allow-extra-chr --double-id \
--extract rice_ldpruned.prune.in \
--pca 10 --out rice_pca
## rice_pca.eigenvec = PC scores; rice_pca.eigenval = variance explained
## Step F: Admixture (population structure, K=2 to K=6)
## Requires ADMIXTURE v1.3
plink2 --vcf rice_filtered.vcf.gz \
--allow-extra-chr --double-id \
--extract rice_ldpruned.prune.in \
--make-bed --out rice_pruned
for K in 2 3 4 5 6; do
admixture --cv rice_pruned.bed ${K} | tee log${K}.out
done
## CV error: grep "CV error" log*.out | sort -k3,3n ==> choose K with min CV
library(ggplot2); library(dplyr); library(tidyr)
## 1. Fst Manhattan plot
fst <- read.table("ind_vs_jap_fst.windowed.weir.fst", header=TRUE, sep="\t")
fst$MEAN_FST[fst$MEAN_FST < 0] <- 0 ## set negative to 0 for display only
fst99 <- quantile(fst$MEAN_FST, 0.99, na.rm=TRUE)
ggplot(fst, aes(BIN_START/1e6, MEAN_FST, colour=MEAN_FST >= fst99)) +
geom_point(size=0.5, alpha=0.7) +
geom_hline(yintercept=fst99, linetype="dashed", colour="red") +
scale_colour_manual(values=c("FALSE"="#6baed6","TRUE"="#e31a1c"), guide=FALSE) +
facet_grid(.~CHROM, scales="free_x", space="free_x") +
labs(x="Position (Mb)", y="Fst", title="Indica vs Japonica -- 50 kb window Fst") +
theme_minimal(base_size=10)
ggsave("fst_manhattan.png", width=14, height=4, dpi=200)
## 2. Pi comparison
pi_ind <- read.table("ind_pi.windowed.pi", header=TRUE); pi_ind$POP <- "Indica"
pi_jap <- read.table("jap_pi.windowed.pi", header=TRUE); pi_jap$POP <- "Japonica"
pi_all <- bind_rows(pi_ind, pi_jap)
ggplot(pi_all, aes(CHROM, PI, fill=POP)) +
geom_boxplot(outlier.size=0.3, outlier.alpha=0.3) +
scale_fill_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
labs(x="Chromosome", y="pi (per site)", title="Nucleotide diversity -- 50 kb windows") +
theme_bw(base_size=11)
ggsave("pi_comparison.png", width=10, height=5, dpi=200)
## 3. Tajima D genome-wide
td <- read.table("ind_tajima.Tajima.D", header=TRUE)
td <- td[td$N_SNPS >= 5, ]
ggplot(td, aes(TajimaD)) +
geom_histogram(binwidth=0.1, fill="#4dac26", colour="white") +
geom_vline(xintercept=c(-2,2), linetype="dotted", colour="red") +
labs(x="Tajima D", y="Count (windows)", title="Tajima D distribution -- Indica",
subtitle="Red lines = +-2 (approx 5% tails)") +
theme_bw(base_size=11)
ggsave("tajimad_distribution.png", width=6, height=4, dpi=200)
## 4. PCA
pca_vecs <- read.table("rice_pca.eigenvec", header=FALSE)
pca_vals <- read.table("rice_pca.eigenval", header=FALSE)
ind_ids <- scan("ind_pops.txt", what="", quiet=TRUE)
pca_vecs$POP <- ifelse(pca_vecs$V1 %in% ind_ids, "Indica", "Japonica")
pct <- round(pca_vals$V1/sum(pca_vals$V1)*100,1)
ggplot(pca_vecs, aes(V3, V4, colour=POP)) +
geom_point(size=2, alpha=0.7) +
scale_colour_manual(values=c("Indica"="#fc8d62","Japonica"="#8da0cb")) +
labs(x=paste0("PC1 (",pct[1],"%)"), y=paste0("PC2 (",pct[2],"%)"),
colour="Subspecies", title="PCA -- population structure") +
theme_bw(base_size=12)
ggsave("pca_structure.png", width=7, height=6, dpi=200)
## 5. ADMIXTURE barplot (K=3)
Q3 <- read.table("rice_pruned.3.Q")
pop_labels <- read.table("ind_pops.txt")$V1
Q3$POP <- ifelse(rownames(Q3) %in% pop_labels, "Indica", "Japonica")
Q3_long <- Q3 %>% mutate(ID=row_number()) %>%
pivot_longer(cols=starts_with("V"), names_to="Ancestry", values_to="Proportion")
ggplot(Q3_long, aes(ID, Proportion, fill=Ancestry)) +
geom_bar(stat="identity", width=1) +
facet_grid(.~Q3$POP[Q3_long$ID], scales="free_x", space="free_x") +
scale_fill_brewer(palette="Set1") +
labs(x="Individual", y="Ancestry proportion", title="ADMIXTURE K=3") +
theme_minimal(base_size=10) +
theme(axis.text.x=element_blank(), strip.text=element_text(face="bold"))
ggsave("admixture_K3.png", width=12, height=3, dpi=200)
| Problem | Symptom | Fix |
|---|---|---|
| Unequal sample sizes inflate Fst | Fst always higher for the smaller population | Use Weir-Cockerham (vcftools --weir-fst-pop). W-C is sample-size corrected. Or downsample with vcftools --max-indv. |
| Missing data biases pi and Tajima D | Windows with many missing genotypes show artificially low pi | Filter with --max-missing 0.9 before computing stats. Exclude windows with fewer than 5 informative SNPs. |
| Related individuals inflate F_IS | Excess homozygosity even in outcrossing species | Remove relatives (KING kinship > 0.125 = 2nd degree) with plink2 --king-cutoff before all analyses. |
| iHS inflated in low-recombination regions | Centromeres show iHS > 3 everywhere | Always normalise iHS in allele-frequency bins. Exclude centromeric regions. Use nSL which is more robust to recombination rate variation. |
| PCA driven by one chromosome arm | Stripe pattern in PCA; a few extreme outliers | LD pruning (--indep-pairwise 50 10 0.2) usually fixes this. Large chromosomal inversions can dominate PC1 -- exclude the inverted region from PCA input. |
| Tajima D significant genome-wide | Nearly all windows show D < -1.5 | Likely population expansion, not genome-wide selection. Run PSMC or SMC++ to infer Ne history and simulate expected D distribution under your demographic model. |