VCF Filtering & Population Statistics

Quality-filter multi-sample VCFs with VCFtools & bcftools · Compute Tajima’s D, π, Fst · LD pruning & PCA with PLINK 2.0

~90 min Intermediate Bash / R
Byte

Overview

Before any population genetic analysis, your VCF must be rigorously filtered. Raw variant calls from GATK or other callers contain genuine polymorphisms, sequencing errors, mapping artefacts, and low-confidence genotypes that will bias every downstream statistic. This tutorial covers the standard filtering and QC pipeline, followed by computing the core summary statistics (nucleotide diversity π, Tajima’s D, pairwise Fst) and a PCA to visualise population structure.

Byte
Key concepts:
  • MAF (Minor Allele Frequency): minimum frequency threshold below which sites are removed — singletons are often sequencing errors
  • Missingness: fraction of individuals lacking a genotype at a site — high missingness causes biased statistics
  • LD pruning: removes correlated SNPs before PCA/ADMIXTURE so that independent loci are analysed
  • Fst: fixation index — proportion of total genetic variance explained by between-population differences (0 = identical, 1 = completely fixed)
  • Tajima’s D: compares two estimators of θ (Watterson’s vs. pairwise) — negative = excess rare alleles (expansion/selection), positive = excess intermediate alleles (balancing selection/bottleneck)

Ideal Directory Structure

Keep raw data untouched and separate from filtered outputs at every step.

popgen_vcf_filtering/
popgen_vcf_filtering/ ├── 0_raw/ # downloaded VCF + reference (never modify) │ ├── 1001genomes_snp.vcf.gz │ └── TAIR10.fa ├── 1_qc/ # bcftools stats output │ └── bcftools_stats.txt ├── 2_filtered/ # VCFtools-filtered VCF │ ├── filtered.vcf.gz │ └── filtered.log ├── 3_diversity/ # per-window Tajima D, pi, Fst │ ├── tajimaD.Tajima.D │ ├── pi.windowed.pi │ └── fst.weir.fst ├── 4_pca/ # PLINK LD-pruned files + eigenvectors │ ├── plink_pruned.bed/bim/fam │ └── pca.eigenvec / pca.eigenval └── 5_plots/ # R-generated figures ├── pca_plot.png └── tajimad_plot.png

Dataset

The Arabidopsis thaliana 1001 Genomes Project sequenced 1,135 natural inbred lines from across the native Eurasian range at ~15x coverage. It is the most comprehensive plant population genomics dataset available and has been used to study local adaptation, selfing, and the genetic architecture of complex traits.

PropertyValue
OrganismArabidopsis thaliana
AccessionPRJNA273563 / 1001genomes.org v3.1
Sample size1,135 natural accessions
SNPs (raw)~10 million
ReferenceTAIR10 (A. thaliana Col-0)
Publication1001 Genomes Consortium, Cell 2016
Why this dataset? It is biallelic SNP-only, already hard-filtered by the consortium, making it an ideal starting point to learn downstream filtering. The accessions span six geographic groups (Africa, Central Europe, Iberia, Scandinavia, Western Europe, Relict) enabling clean multi-population Fst and PCA demonstrations.

Step 1 — Download & Inspect VCF

Bash — Download 1001 Genomes VCF
mkdir -p 0_raw
cd 0_raw

# Download the 1001 Genomes SNP VCF (all chromosomes, ~1.4 GB compressed)
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz.tbi

# Or download chromosome-by-chromosome for testing (Chr1 only, ~280 MB)
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/by_chromosome/GMI-MPI_1001-genomes_chr1.vcf.gz

# Quick inspection
module load bcftools/1.17
bcftools stats 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | head -50

# Count variants and samples
bcftools query -l 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | wc -l   # samples
bcftools view --no-header 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | wc -l   # variants

# View first few records
bcftools view -H 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | head -5

cd ..
Download:
Download and inspection commands decoded
📊bcftools statsGenerates a detailed QC report including: number of SNPs, indels, ts/tv ratio, per-sample depth, and allele frequency distribution. Pipe through grep ^SN to get just the summary numbers.
👥bcftools query -lLists all sample names in the VCF header — one per line. Piping to wc -l gives the sample count. Use this to verify the VCF contains the samples you expect.
📄.tbi index fileTabix index for random-access queries by genomic coordinate. Required for bcftools region queries (-r Chr1:1-1000000). Always download the .tbi alongside the VCF.gz.

Step 2 — bcftools QC & Stats

Bash — bcftools stats and multiallelic check
mkdir -p 1_qc
VCF="0_raw/1001genomes_snp-short-indel_only_ACGTN.vcf.gz"

# Full stats report
bcftools stats ${VCF} > 1_qc/bcftools_stats.txt

# Extract key summary lines
grep "^SN" 1_qc/bcftools_stats.txt

# Check for multiallelic sites (more than 2 alleles)
bcftools view --min-alleles 3 ${VCF} | bcftools stats | grep "^SN"

# Split multiallelic sites into biallelic (required by PLINK/ADMIXTURE)
bcftools norm \
    --multiallelics -snps \
    --output-type z \
    --output 1_qc/biallelic.vcf.gz \
    ${VCF}
bcftools index --tbi 1_qc/biallelic.vcf.gz

# Count SNPs only (no indels) after splitting
bcftools view --type snps 1_qc/biallelic.vcf.gz | bcftools stats | grep "^SN"

echo "QC complete. Check 1_qc/bcftools_stats.txt"
Download:
bcftools normalisation decoded
🔨bcftools norm --multiallelics -snpsSplits multiallelic SNP records (e.g. A→C,T at one position) into separate biallelic records (A→C and A→T). PLINK and ADMIXTURE require biallelic-only VCFs. The -snps flag processes only SNPs, leaving indels untouched.
📊ts/tv ratioTransition/transversion ratio. Expected ~2.0–2.1 for whole-genome SNPs. Values far from this (e.g. 1.5) indicate poor filtering or mapping artefacts. Check this in bcftools_stats.txt under TSTV.

Step 3 — VCFtools Filtering

Apply quality filters to remove low-confidence sites and samples:

Bash — VCFtools quality filtering
mkdir -p 2_filtered
module load vcftools/0.1.16

VCF="1_qc/biallelic.vcf.gz"
OUT="2_filtered/filtered"

vcftools \
    --gzvcf ${VCF} \
    --remove-indels \
    --maf 0.05 \
    --max-missing 0.9 \
    --minQ 30 \
    --min-meanDP 5 \
    --max-meanDP 50 \
    --minDP 3 \
    --hwe 0.001 \
    --recode \
    --recode-INFO-all \
    --out ${OUT}

# Compress and index the filtered VCF
bgzip ${OUT}.recode.vcf
tabix -p vcf ${OUT}.recode.vcf.gz
mv ${OUT}.recode.vcf.gz ${OUT}.vcf.gz
mv ${OUT}.recode.vcf.gz.tbi ${OUT}.vcf.gz.tbi

echo "Variants before filtering: $(bcftools view --no-header 0_raw/1001genomes_snp-short-indel_only_ACGTN.vcf.gz | wc -l)"
echo "Variants after filtering:  $(bcftools view --no-header ${OUT}.vcf.gz | wc -l)"
Download:
VCFtools filter flags decoded
📊--maf 0.05Remove sites where the minor allele frequency is <5%. Singletons and very rare variants are disproportionately affected by genotyping errors. For population structure analyses (PCA, ADMIXTURE) you want common variants that are informative across many individuals.
--max-missing 0.9Keep only sites where ≥90% of individuals have a genotype call. Sites with >10% missing data are unreliable for allele frequency estimates. Counterintuitively, 0.9 means “allow up to 10% missing” in VCFtools notation.
🎯--minQ 30Minimum QUAL score of 30 (Phred-scaled probability that the site is not a variant). Q30 = 0.1% chance the variant call is wrong. Sites below Q30 are likely sequencing or mapping artefacts.
📈--min-meanDP 5 --max-meanDP 50Filter by mean depth across samples. Sites with very low depth (<5x) have unreliable genotype calls. Sites with very high depth (>50x for most organisms) are often in repetitive/collapsed regions and produce false heterozygosity.
🧬--hwe 0.001Remove sites that deviate significantly from Hardy-Weinberg equilibrium (p<0.001). In diploid outbreeders, severe HWE deviation indicates genotyping errors (e.g. paralog calling). For selfing species like A. thaliana, skip this filter — inbreeders naturally violate HWE.
A. thaliana is a selfer — HWE filtering is wrong here! Arabidopsis thaliana is predominantly self-fertilising, so virtually all accessions are highly homozygous. HWE assumes random mating — applying it to a selfing species will remove real, biologically valid variants. Remove the --hwe flag for this dataset and any other predominantly selfing/asexual organism.

Step 4 — Nucleotide Diversity & Tajima’s D

Bash — Per-window diversity statistics
mkdir -p 3_diversity
VCF="2_filtered/filtered.vcf.gz"

# Nucleotide diversity (pi) in 10 kb windows
vcftools \
    --gzvcf ${VCF} \
    --window-pi 10000 \
    --window-pi-step 5000 \
    --out 3_diversity/pi

# Tajima's D in 10 kb windows
vcftools \
    --gzvcf ${VCF} \
    --TajimaD 10000 \
    --out 3_diversity/tajimaD

# Watterson's theta (per site)
vcftools \
    --gzvcf ${VCF} \
    --window-pi 10000 \
    --out 3_diversity/theta

# Quick summary statistics
echo "=== Tajima's D summary ==="
awk 'NR>1 {sum+=$4; n++; if($4>max || NR==2) max=$4; if($41 && $5!="nan" {sum+=$5; n++} END {print "Mean pi:", sum/n}' 3_diversity/pi.windowed.pi
Download:
Diversity statistics decoded
🌹--window-pi 10000 --window-pi-step 5000Computes nucleotide diversity (π) in sliding 10 kb windows stepping 5 kb at a time (50% overlap). Outputs .windowed.pi with columns: CHROM, BIN_START, BIN_END, N_VARIANTS, PI. Higher π = more variation = likely neutral or balancing selection region.
📊--TajimaD 10000Computes Tajima’s D in non-overlapping 10 kb windows. Positive D = excess of intermediate-frequency alleles (balancing selection or population bottleneck). Negative D = excess of rare alleles (population expansion or purifying/positive selection sweeping rare alleles to fixation).
🧬
Interpreting Tajima’s D in A. thaliana: The genome-wide mean D for Arabidopsis is around −1 due to its predominantly selfing mating system and recent range expansion out of Central Asia. Regions with strongly positive D (e.g. disease resistance genes) stand out as candidates for balancing selection.

Step 5 — Fst Between Populations

Compute pairwise Fst between geographic groups. You need a population map file (two-column: sample_name, population):

Bash — Windowed Fst between geographic groups
VCF="2_filtered/filtered.vcf.gz"

# Create population lists from the 1001 Genomes metadata
# Download metadata: https://1001genomes.org/accessions.html
# Columns: accession_id, country, group (Admixed/Central Europe/Germany/...)

# Example: extract Scandinavian vs. Iberian accession IDs
# (adjust based on your downloaded metadata file)
awk -F'\t' '$3=="Scandinavia" {print $1}' metadata_1001g.txt > pop_Scandinavia.txt
awk -F'\t' '$3=="Iberia"      {print $1}' metadata_1001g.txt > pop_Iberia.txt

echo "Scandinavia N=$(wc -l < pop_Scandinavia.txt)"
echo "Iberia      N=$(wc -l < pop_Iberia.txt)"

# Calculate windowed Fst (Weir & Cockerham 1984)
vcftools \
    --gzvcf ${VCF} \
    --weir-fst-pop pop_Scandinavia.txt \
    --weir-fst-pop pop_Iberia.txt \
    --fst-window-size 50000 \
    --fst-window-step 25000 \
    --out 3_diversity/fst_Scand_vs_Iberia

# Check mean weighted Fst (reported at the end of the log)
tail -5 3_diversity/fst_Scand_vs_Iberia.log

# Identify top Fst windows (candidate regions for local adaptation)
sort -k5 -rn 3_diversity/fst_Scand_vs_Iberia.windowed.weir.fst | head -20
Download:
Fst calculation decoded
🌎--weir-fst-popSpecifies a population file for Fst calculation. Provide two --weir-fst-pop flags for pairwise Fst, or three+ for multi-population comparison. VCFtools uses the Weir & Cockerham (1984) estimator, which corrects for unequal sample sizes between populations.
📊--fst-window-size 50000 --fst-window-step 2500050 kb sliding windows stepping 25 kb. Larger windows give smoother, more stable estimates but may mask narrow sweeps. For fine-mapping locally adapted regions, use 10–20 kb windows. The output column WEIGHTED_FST is the variance-weighted Fst — use this for ranking windows, not MEAN_FST.

Step 6 — LD Pruning & PLINK PCA

Bash — PLINK 2.0 LD pruning and PCA
mkdir -p 4_pca
module load plink2/2.0

VCF="2_filtered/filtered.vcf.gz"

# Step 1: Convert VCF to PLINK binary format
plink2 \
    --vcf ${VCF} \
    --make-bed \
    --out 4_pca/plink_all \
    --allow-extra-chr \
    --set-all-var-ids '@_#_$r_$a' \
    --new-id-max-allele-len 10 truncate

# Step 2: LD pruning — remove correlated SNPs
# Window: 50 SNPs, step: 10 SNPs, r^2 threshold: 0.1
plink2 \
    --bfile 4_pca/plink_all \
    --indep-pairwise 50 10 0.1 \
    --allow-extra-chr \
    --out 4_pca/ld_prune

echo "SNPs before pruning: $(wc -l < 4_pca/plink_all.bim)"
echo "SNPs after pruning:  $(wc -l < 4_pca/ld_prune.prune.in)"

# Step 3: Extract pruned SNP set
plink2 \
    --bfile 4_pca/plink_all \
    --extract 4_pca/ld_prune.prune.in \
    --make-bed \
    --allow-extra-chr \
    --out 4_pca/plink_pruned

# Step 4: Run PCA (first 20 PCs)
plink2 \
    --bfile 4_pca/plink_pruned \
    --pca 20 \
    --allow-extra-chr \
    --out 4_pca/pca

echo "PCA complete. Files: 4_pca/pca.eigenvec and 4_pca/pca.eigenval"
Download:
PLINK 2.0 PCA flags decoded
🔨--indep-pairwise 50 10 0.1LD pruning: scan SNPs in windows of 50, shift 10 SNPs at a time, remove one SNP from each pair with r² > 0.1. Produces .prune.in (keep) and .prune.out (remove) lists. Strict pruning (r² < 0.1) is appropriate for PCA and ADMIXTURE to ensure each SNP is independent.
🌎--allow-extra-chrRequired when your VCF uses non-numeric chromosome names (e.g. Chr1, Chr2 for A. thaliana) or chloroplast/mitochondria chromosomes. Without this flag, PLINK refuses to process non-standard chromosome codes.
📊--pca 20Compute the first 20 principal components. For most population datasets, PCs 1–5 capture the major population structure; higher PCs may reflect finer-scale structure or technical artefacts. Output: .eigenvec (PC scores per individual) and .eigenval (variance explained per PC).

Step 7 — Visualise PCA in R

R — PCA plot coloured by geographic group
library(ggplot2)
library(tidyverse)

# Load PCA results
pca <- read.table("4_pca/pca.eigenvec", header=FALSE)
eigenval <- read.table("4_pca/pca.eigenval", header=FALSE)

# Name columns
colnames(pca) <- c("FID", "IID", paste0("PC", 1:20))

# Calculate % variance explained
pct_var <- round(eigenval$V1 / sum(eigenval$V1) * 100, 1)
cat("PC1:", pct_var[1], "%\nPC2:", pct_var[2], "%\n")

# Load metadata (geographic group)
meta <- read.table("metadata_1001g.txt", header=TRUE, sep="\t") %>%
    rename(IID = accession_id)

# Merge
pca_meta <- left_join(pca, meta, by="IID")

# Plot PC1 vs PC2
ggplot(pca_meta, aes(x=PC1, y=PC2, color=group)) +
    geom_point(alpha=0.7, size=1.5) +
    labs(
        x = paste0("PC1 (", pct_var[1], "%)"),
        y = paste0("PC2 (", pct_var[2], "%)"),
        title = "A. thaliana 1001 Genomes — PCA",
        color = "Geographic group"
    ) +
    theme_bw(base_size=13) +
    theme(legend.position="right") +
    scale_color_brewer(palette="Set1")

ggsave("5_plots/pca_plot.png", width=9, height=7, dpi=150)

# Plot Tajima's D across the genome
tajima <- read.table("3_diversity/tajimaD.Tajima.D", header=TRUE)
tajima <- tajima[tajima$TajimaD != "NaN", ]
tajima$TajimaD <- as.numeric(tajima$TajimaD)

ggplot(tajima, aes(x=BIN_START/1e6, y=TajimaD, color=TajimaD)) +
    geom_point(size=0.4, alpha=0.6) +
    facet_wrap(~CHROM, scales="free_x") +
    scale_color_gradient2(low="blue", mid="grey70", high="red", midpoint=0) +
    geom_hline(yintercept=0, linetype="dashed", color="black") +
    labs(x="Position (Mb)", y="Tajima's D",
         title="Tajima's D across A. thaliana genome (10 kb windows)") +
    theme_bw(base_size=11) +
    theme(legend.position="none")

ggsave("5_plots/tajimad_plot.png", width=14, height=8, dpi=150)
Download:
R visualisation decoded
📊eigenval / sum * 100Converts raw eigenvalues to percentage variance explained per PC. Always label your PCA axes with variance explained — reviewers will ask for this. PC1 explaining >10% is typical for structured populations.
🌈scale_color_gradient2(midpoint=0)Diverging colour scale for Tajima’s D: blue = negative (expansion/positive selection), grey = neutral, red = positive (balancing selection/bottleneck). The genome-wide mean is plotted as a dashed line for reference.
🌠
Expected PCA result: For the 1001 Genomes data you should see PC1 separating African relict accessions from Eurasian accessions, and PC2 separating Scandinavian from Iberian/Western European accessions. This matches the known colonisation history of A. thaliana out of Central Asia into Europe after the last glacial maximum.
All Tutorials ADMIXTURE Population Structure