Before any population genetic analysis, your VCF must be rigorously filtered. Raw variant calls from GATK or other callers contain genuine polymorphisms, sequencing errors, mapping artefacts, and low-confidence genotypes that will bias every downstream statistic. This tutorial covers the standard filtering and QC pipeline, followed by computing the core summary statistics (nucleotide diversity π, Tajima’s D, pairwise Fst) and a PCA to visualise population structure.
Key concepts:
MAF (Minor Allele Frequency): minimum frequency threshold below which sites are removed — singletons are often sequencing errors
Missingness: fraction of individuals lacking a genotype at a site — high missingness causes biased statistics
LD pruning: removes correlated SNPs before PCA/ADMIXTURE so that independent loci are analysed
Fst: fixation index — proportion of total genetic variance explained by between-population differences (0 = identical, 1 = completely fixed)
Tajima’s D: compares two estimators of θ (Watterson’s vs. pairwise) — negative = excess rare alleles (expansion/selection), positive = excess intermediate alleles (balancing selection/bottleneck)
Ideal Directory Structure
Keep raw data untouched and separate from filtered outputs at every step.
The Arabidopsis thaliana 1001 Genomes Project sequenced 1,135 natural inbred lines from across the native Eurasian range at ~15x coverage. It is the most comprehensive plant population genomics dataset available and has been used to study local adaptation, selfing, and the genetic architecture of complex traits.
Why this dataset? It is biallelic SNP-only, already hard-filtered by the consortium, making it an ideal starting point to learn downstream filtering. The accessions span six geographic groups (Africa, Central Europe, Iberia, Scandinavia, Western Europe, Relict) enabling clean multi-population Fst and PCA demonstrations.
Step 1 — Download & Inspect VCF
Bash — Download 1001 Genomes VCF
mkdir -p 0_raw
cd 0_raw
# Download the 1001 Genomes SNP VCF (all chromosomes, ~1.4 GB compressed)
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz.tbi
# Or download chromosome-by-chromosome for testing (Chr1 only, ~280 MB)
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/by_chromosome/GMI-MPI_1001-genomes_chr1.vcf.gz
# Quick inspection
module load bcftools/1.17
bcftools stats 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | head -50
# Count variants and samples
bcftools query -l 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | wc -l # samples
bcftools view --no-header 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | wc -l # variants
# View first few records
bcftools view -H 1001genomes_snp-short-indel_only_ACGTN.vcf.gz | head -5
cd ..
Download:
Download and inspection commands decoded
📊bcftools stats
Generates a detailed QC report including: number of SNPs, indels, ts/tv ratio, per-sample depth, and allele frequency distribution. Pipe through grep ^SN to get just the summary numbers.
👥bcftools query -l
Lists all sample names in the VCF header — one per line. Piping to wc -l gives the sample count. Use this to verify the VCF contains the samples you expect.
📄.tbi index file
Tabix index for random-access queries by genomic coordinate. Required for bcftools region queries (-r Chr1:1-1000000). Always download the .tbi alongside the VCF.gz.
Step 2 — bcftools QC & Stats
Bash — bcftools stats and multiallelic check
mkdir -p 1_qc
VCF="0_raw/1001genomes_snp-short-indel_only_ACGTN.vcf.gz"
# Full stats report
bcftools stats ${VCF} > 1_qc/bcftools_stats.txt
# Extract key summary lines
grep "^SN" 1_qc/bcftools_stats.txt
# Check for multiallelic sites (more than 2 alleles)
bcftools view --min-alleles 3 ${VCF} | bcftools stats | grep "^SN"
# Split multiallelic sites into biallelic (required by PLINK/ADMIXTURE)
bcftools norm \
--multiallelics -snps \
--output-type z \
--output 1_qc/biallelic.vcf.gz \
${VCF}
bcftools index --tbi 1_qc/biallelic.vcf.gz
# Count SNPs only (no indels) after splitting
bcftools view --type snps 1_qc/biallelic.vcf.gz | bcftools stats | grep "^SN"
echo "QC complete. Check 1_qc/bcftools_stats.txt"
Download:
bcftools normalisation decoded
🔨bcftools norm --multiallelics -snps
Splits multiallelic SNP records (e.g. A→C,T at one position) into separate biallelic records (A→C and A→T). PLINK and ADMIXTURE require biallelic-only VCFs. The -snps flag processes only SNPs, leaving indels untouched.
📊ts/tv ratio
Transition/transversion ratio. Expected ~2.0–2.1 for whole-genome SNPs. Values far from this (e.g. 1.5) indicate poor filtering or mapping artefacts. Check this in bcftools_stats.txt under TSTV.
Step 3 — VCFtools Filtering
Apply quality filters to remove low-confidence sites and samples:
Remove sites where the minor allele frequency is <5%. Singletons and very rare variants are disproportionately affected by genotyping errors. For population structure analyses (PCA, ADMIXTURE) you want common variants that are informative across many individuals.
❌--max-missing 0.9
Keep only sites where ≥90% of individuals have a genotype call. Sites with >10% missing data are unreliable for allele frequency estimates. Counterintuitively, 0.9 means “allow up to 10% missing” in VCFtools notation.
🎯--minQ 30
Minimum QUAL score of 30 (Phred-scaled probability that the site is not a variant). Q30 = 0.1% chance the variant call is wrong. Sites below Q30 are likely sequencing or mapping artefacts.
📈--min-meanDP 5 --max-meanDP 50
Filter by mean depth across samples. Sites with very low depth (<5x) have unreliable genotype calls. Sites with very high depth (>50x for most organisms) are often in repetitive/collapsed regions and produce false heterozygosity.
🧬--hwe 0.001
Remove sites that deviate significantly from Hardy-Weinberg equilibrium (p<0.001). In diploid outbreeders, severe HWE deviation indicates genotyping errors (e.g. paralog calling). For selfing species like A. thaliana, skip this filter — inbreeders naturally violate HWE.
⚠
A. thaliana is a selfer — HWE filtering is wrong here! Arabidopsis thaliana is predominantly self-fertilising, so virtually all accessions are highly homozygous. HWE assumes random mating — applying it to a selfing species will remove real, biologically valid variants. Remove the --hwe flag for this dataset and any other predominantly selfing/asexual organism.
Computes nucleotide diversity (π) in sliding 10 kb windows stepping 5 kb at a time (50% overlap). Outputs .windowed.pi with columns: CHROM, BIN_START, BIN_END, N_VARIANTS, PI. Higher π = more variation = likely neutral or balancing selection region.
📊--TajimaD 10000
Computes Tajima’s D in non-overlapping 10 kb windows. Positive D = excess of intermediate-frequency alleles (balancing selection or population bottleneck). Negative D = excess of rare alleles (population expansion or purifying/positive selection sweeping rare alleles to fixation).
🧬
Interpreting Tajima’s D in A. thaliana: The genome-wide mean D for Arabidopsis is around −1 due to its predominantly selfing mating system and recent range expansion out of Central Asia. Regions with strongly positive D (e.g. disease resistance genes) stand out as candidates for balancing selection.
Step 5 — Fst Between Populations
Compute pairwise Fst between geographic groups. You need a population map file (two-column: sample_name, population):
Bash — Windowed Fst between geographic groups
VCF="2_filtered/filtered.vcf.gz"
# Create population lists from the 1001 Genomes metadata
# Download metadata: https://1001genomes.org/accessions.html
# Columns: accession_id, country, group (Admixed/Central Europe/Germany/...)
# Example: extract Scandinavian vs. Iberian accession IDs
# (adjust based on your downloaded metadata file)
awk -F'\t' '$3=="Scandinavia" {print $1}' metadata_1001g.txt > pop_Scandinavia.txt
awk -F'\t' '$3=="Iberia" {print $1}' metadata_1001g.txt > pop_Iberia.txt
echo "Scandinavia N=$(wc -l < pop_Scandinavia.txt)"
echo "Iberia N=$(wc -l < pop_Iberia.txt)"
# Calculate windowed Fst (Weir & Cockerham 1984)
vcftools \
--gzvcf ${VCF} \
--weir-fst-pop pop_Scandinavia.txt \
--weir-fst-pop pop_Iberia.txt \
--fst-window-size 50000 \
--fst-window-step 25000 \
--out 3_diversity/fst_Scand_vs_Iberia
# Check mean weighted Fst (reported at the end of the log)
tail -5 3_diversity/fst_Scand_vs_Iberia.log
# Identify top Fst windows (candidate regions for local adaptation)
sort -k5 -rn 3_diversity/fst_Scand_vs_Iberia.windowed.weir.fst | head -20
Download:
Fst calculation decoded
🌎--weir-fst-pop
Specifies a population file for Fst calculation. Provide two --weir-fst-pop flags for pairwise Fst, or three+ for multi-population comparison. VCFtools uses the Weir & Cockerham (1984) estimator, which corrects for unequal sample sizes between populations.
📊--fst-window-size 50000 --fst-window-step 25000
50 kb sliding windows stepping 25 kb. Larger windows give smoother, more stable estimates but may mask narrow sweeps. For fine-mapping locally adapted regions, use 10–20 kb windows. The output column WEIGHTED_FST is the variance-weighted Fst — use this for ranking windows, not MEAN_FST.
LD pruning: scan SNPs in windows of 50, shift 10 SNPs at a time, remove one SNP from each pair with r² > 0.1. Produces .prune.in (keep) and .prune.out (remove) lists. Strict pruning (r² < 0.1) is appropriate for PCA and ADMIXTURE to ensure each SNP is independent.
🌎--allow-extra-chr
Required when your VCF uses non-numeric chromosome names (e.g. Chr1, Chr2 for A. thaliana) or chloroplast/mitochondria chromosomes. Without this flag, PLINK refuses to process non-standard chromosome codes.
📊--pca 20
Compute the first 20 principal components. For most population datasets, PCs 1–5 capture the major population structure; higher PCs may reflect finer-scale structure or technical artefacts. Output: .eigenvec (PC scores per individual) and .eigenval (variance explained per PC).
Converts raw eigenvalues to percentage variance explained per PC. Always label your PCA axes with variance explained — reviewers will ask for this. PC1 explaining >10% is typical for structured populations.
🌈scale_color_gradient2(midpoint=0)
Diverging colour scale for Tajima’s D: blue = negative (expansion/positive selection), grey = neutral, red = positive (balancing selection/bottleneck). The genome-wide mean is plotted as a dashed line for reference.
🌠
Expected PCA result: For the 1001 Genomes data you should see PC1 separating African relict accessions from Eurasian accessions, and PC2 separating Scandinavian from Iberian/Western European accessions. This matches the known colonisation history of A. thaliana out of Central Asia into Europe after the last glacial maximum.