Fst & Selective Sweep Detection

Windowed Fst between populations · Tajima’s D & π neutrality tests · SweeD composite likelihood sweeps · XP-EHH & iHS haplotype statistics with rehh

~120 min Advanced Bash / R
Byte

Overview

Selective sweeps leave characteristic genomic footprints: reduced diversity, skewed allele frequencies, and long identical haplotypes surrounding the selected locus. No single statistic catches all sweeps — a hard sweep (one allele fixed rapidly) looks different from a soft sweep (multiple haplotypes selected simultaneously) or a partial sweep (still in progress). This tutorial teaches you a multi-layered approach combining three complementary methods:

MethodDetectsTime scaleNeeds phased data?
Windowed FstDifferentiation between populationsAnyNo
Tajima’s D / πAllele frequency skews, reduced diversityAncient–recentNo
SweeD (CLR)Hard sweeps — site frequency spectrum distortionRecent–ancientNo
XP-EHH / iHSIncomplete sweeps — extended haplotype homozygosityVery recentYes
Byte
Dataset: Maize domestication from teosinte (Zea mays ssp. parviglumis) is one of the best-characterised examples of artificial selection in plants. Hufford et al. 2012 sequenced 75 maize landrace + 26 teosinte accessions at ~30× coverage and identified hundreds of domestication sweep regions including the iconic tb1 (teosinte branched 1) and su1 (sugary 1) loci.

Ideal Directory Structure

popgen_sweeps_maize/
popgen_sweeps_maize/ ├── 0_raw/ # raw VCF from PRJNA178613 ├── 1_filtered/ # VCFtools-filtered biallelic SNPs ├── 2_popfiles/ # text files listing sample IDs per population │ ├── maize.txt │ └── teosinte.txt ├── 3_fst/ # windowed Fst output ├── 4_diversity/ # pi and Tajima's D windows ├── 5_sweed/ # SweeD CLR output per chromosome ├── 6_rehh/ # rehh haplotype input + XP-EHH/iHS output └── 7_plots/ # all R figures ├── manhattan_fst.png ├── sweed_clr.png └── xpehh_ihs.png

Dataset

The maize domestication dataset from Hufford et al. 2012 (Nature Genetics) is the landmark study of maize population genomics. It compared domesticated maize landraces vs. their wild ancestor teosinte, enabling direct measurement of the genomic footprint of domestication selection.

PropertyValue
OrganismZea mays ssp. mays (maize) vs. ssp. parviglumis (teosinte)
AccessionPRJNA178613 (SRA)
Samples75 maize landraces + 26 teosinte accessions
ReferenceB73 RefGen_v4 (GCF_902167145.1)
Notable sweep locitb1 (chr1), su1 (chr4), bt2 (chr4)
PublicationHufford et al. 2012, Nature Genetics

Step 1 — Download & Prepare VCF

Bash — Download maize/teosinte VCF and filter
######################################################################
## STEP 1: Download and filter the maize/teosinte SNP dataset
## Goal: get a clean biallelic SNP-only VCF ready for analysis
######################################################################

## --- Create output directories ---
mkdir -p 0_raw 1_filtered 2_popfiles 3_fst 4_diversity 5_sweed 6_rehh 7_plots

## --- Load required software modules ---
## (Adjust module names to match your HPC environment)
module load sra-toolkit   # for downloading from NCBI SRA
module load bcftools/1.17 # for VCF manipulation
module load vcftools/0.1.16 # for filtering and population statistics

## --- Download the raw VCF from NCBI (pre-called SNPs from the original paper) ---
## This downloads the final SNP VCF directly (not the raw FASTQ reads)
## File size is approximately 2 GB compressed
wget -P 0_raw/ \
  https://de.cyverse.org/dl/d/0A080C3D-C932-48A4-B73E-D49D4BC2EE93/ZeaGBSv2.7_20171204_AGPv4_maf005_max0.1miss.vcf.gz

## If the above link is unavailable, download raw reads from SRA and call variants:
## prefetch --option-file sra_accessions.txt  (see GATK tutorial for variant calling)

## --- Index the VCF so bcftools can query by region ---
bcftools index --tbi 0_raw/ZeaGBSv2.7_20171204_AGPv4_maf005_max0.1miss.vcf.gz

## --- Filter: keep only biallelic SNPs with MAF>5% and <10% missing ---
## --remove-indels   : remove insertion/deletion variants (SNPs only)
## --maf 0.05        : remove sites where minor allele frequency < 5%
## --max-missing 0.9 : remove sites where >10% of samples lack a genotype
## --recode          : write output in VCF format
## --out             : prefix for output files
vcftools \
    --gzvcf 0_raw/ZeaGBSv2.7_20171204_AGPv4_maf005_max0.1miss.vcf.gz \
    --remove-indels \
    --maf 0.05 \
    --max-missing 0.9 \
    --recode \
    --out 1_filtered/maize_filtered

## --- Compress and index the filtered VCF ---
## bgzip: block-compressed gzip (required for tabix random access)
## tabix: creates .tbi index for fast coordinate-based queries
bgzip 1_filtered/maize_filtered.recode.vcf
tabix -p vcf 1_filtered/maize_filtered.recode.vcf.gz
mv 1_filtered/maize_filtered.recode.vcf.gz 1_filtered/maize_filtered.vcf.gz
mv 1_filtered/maize_filtered.recode.vcf.gz.tbi 1_filtered/maize_filtered.vcf.gz.tbi

## --- Create population ID files ---
## These are simple text files — one sample name per line per population
## Sample names must exactly match the VCF header sample IDs
## Get the full sample list from the VCF header first:
bcftools query -l 1_filtered/maize_filtered.vcf.gz > 2_popfiles/all_samples.txt

## Then manually separate into maize and teosinte based on metadata
## from the original publication (Supplementary Table 1, Hufford 2012)
## Example: first 75 samples = maize, last 26 = teosinte
head -75 2_popfiles/all_samples.txt > 2_popfiles/maize.txt
tail -26 2_popfiles/all_samples.txt > 2_popfiles/teosinte.txt

echo "===== Dataset ready ====="
echo "Maize samples:    $(wc -l < 2_popfiles/maize.txt)"
echo "Teosinte samples: $(wc -l < 2_popfiles/teosinte.txt)"
echo "Total SNPs:       $(bcftools view --no-header 1_filtered/maize_filtered.vcf.gz | wc -l)"
Download:
Data preparation decoded
🔨bgzip vs gzipbgzip creates block-compressed files that tabix can index for random-access by chromosome:position. Regular gzip cannot be indexed this way. Always use bgzip for VCFs you plan to query with bcftools or tabix.
📄Population ID filesSimple one-column text files with sample IDs matching the VCF header exactly (case-sensitive). VCFtools uses these files to split samples into groups for per-population statistics. Getting sample assignment wrong silently produces incorrect Fst values — always double-check.

Step 2 — Windowed Fst

Bash — Weir & Cockerham windowed Fst (maize vs teosinte)
######################################################################
## STEP 2: Calculate windowed Fst between maize and teosinte
## Fst measures how different two populations are at each genomic region
## High Fst = one population has a very different allele frequency there
##   = possible evidence of selection favouring different alleles
######################################################################

VCF="1_filtered/maize_filtered.vcf.gz"   # filtered VCF from Step 1
POP1="2_popfiles/maize.txt"              # file listing maize sample IDs
POP2="2_popfiles/teosinte.txt"           # file listing teosinte sample IDs

## --- Calculate windowed Fst ---
## --weir-fst-pop : specify one population file per flag (need at least 2)
## --fst-window-size 50000  : each window spans 50,000 base pairs (50 kb)
## --fst-window-step 25000  : windows slide every 25,000 bp (50% overlap)
##   → overlapping windows give smoother genome-wide signal
## --out : prefix for output files (.windowed.weir.fst will be created)
vcftools \
    --gzvcf ${VCF} \
    --weir-fst-pop ${POP1} \
    --weir-fst-pop ${POP2} \
    --fst-window-size 50000 \
    --fst-window-step 25000 \
    --out 3_fst/maize_vs_teo

## --- View the output ---
## Columns: CHROM, BIN_START, BIN_END, N_VARIANTS (SNPs in window),
##          WEIGHTED_FST (use this!), MEAN_FST
head 3_fst/maize_vs_teo.windowed.weir.fst

## --- Extract top 1% highest-Fst windows (candidate sweep regions) ---
## Sort by column 5 (WEIGHTED_FST) in reverse order, skip header
## Calculate 99th percentile cutoff with awk first
CUTOFF=$(awk 'NR>1 {print $5}' 3_fst/maize_vs_teo.windowed.weir.fst | \
         sort -n | \
         awk 'BEGIN{n=0} {vals[n++]=$1} END{print vals[int(n*0.99)]}')

echo "Top 1% Fst cutoff: ${CUTOFF}"

## Extract windows above the cutoff
awk -v cut="${CUTOFF}" 'NR==1 || $5 >= cut' \
    3_fst/maize_vs_teo.windowed.weir.fst \
    > 3_fst/top1pct_fst_windows.txt

echo "High-Fst windows: $(wc -l < 3_fst/top1pct_fst_windows.txt)"
Download:
Windowed Fst parameters decoded
🎯WEIGHTED_FST vs MEAN_FSTWEIGHTED_FST weights each SNP’s contribution by its variance — this is the Weir & Cockerham estimator and is the correct value to use for ranking windows. MEAN_FST is the simple arithmetic mean and can be biased when sample sizes differ between populations.
📈Top 1% cutoffUsing an empirical percentile threshold (top 1% or 5%) is the standard approach for detecting outlier windows. Avoid using a fixed absolute cutoff (e.g. Fst > 0.5) — the appropriate threshold depends on your organism’s baseline differentiation level.
👥Window size choice50 kb windows are standard for maize (~2.4 Gb genome, high recombination). For organisms with lower recombination rates, use larger windows (100–200 kb). Smaller windows (10–20 kb) increase resolution but produce noisier per-window estimates.

Step 3 — Diversity Statistics: π and Tajima’s D

Bash — Per-population pi and Tajima's D in sliding windows
######################################################################
## STEP 3: Calculate nucleotide diversity (pi) and Tajima's D
## These statistics measure how "variable" the genome is within
## each population in each window:
##   pi      = average pairwise differences per site (more = more diverse)
##   Tajima's D = negative during sweeps (excess of rare alleles)
######################################################################

VCF="1_filtered/maize_filtered.vcf.gz"

## --- Calculate pi for maize only ---
## --keep : restrict analysis to samples in this file (maize only)
## --window-pi 50000 : sliding 50 kb windows
## --window-pi-step 25000 : 25 kb step (50% overlap with previous window)
vcftools \
    --gzvcf ${VCF} \
    --keep 2_popfiles/maize.txt \
    --window-pi 50000 \
    --window-pi-step 25000 \
    --out 4_diversity/pi_maize

## --- Calculate pi for teosinte only ---
vcftools \
    --gzvcf ${VCF} \
    --keep 2_popfiles/teosinte.txt \
    --window-pi 50000 \
    --window-pi-step 25000 \
    --out 4_diversity/pi_teosinte

## --- Calculate Tajima's D for maize ---
## --TajimaD 50000 : non-overlapping 50 kb windows
## Output file: .Tajima.D  (columns: CHROM, BIN_START, N_SNPS, TajimaD)
vcftools \
    --gzvcf ${VCF} \
    --keep 2_popfiles/maize.txt \
    --TajimaD 50000 \
    --out 4_diversity/tajD_maize

## --- Calculate Tajima's D for teosinte ---
vcftools \
    --gzvcf ${VCF} \
    --keep 2_popfiles/teosinte.txt \
    --TajimaD 50000 \
    --out 4_diversity/tajD_teosinte

## --- Quick genome-wide summaries ---
echo "=== Maize mean pi ==="
awk 'NR>1 && $5!="nan" {sum+=$5; n++} END {printf "%.6f\n", sum/n}' \
    4_diversity/pi_maize.windowed.pi

echo "=== Teosinte mean pi ==="
awk 'NR>1 && $5!="nan" {sum+=$5; n++} END {printf "%.6f\n", sum/n}' \
    4_diversity/pi_teosinte.windowed.pi

echo "=== Maize mean Tajima's D ==="
awk 'NR>1 && $4!="NaN" {sum+=$4; n++} END {printf "%.4f\n", sum/n}' \
    4_diversity/tajD_maize.Tajima.D
Download:
Diversity statistics decoded
🌹pi (nucleotide diversity)Average number of pairwise differences per base pair between two randomly sampled sequences. Expect maize π < teosinte π genome-wide because domestication caused a population bottleneck, reducing diversity. At sweep loci, maize π drops even lower than the genome-wide bottleneck reduction.
📊Tajima’s D during a sweepUnder a selective sweep, rare alleles accumulate around the selected locus (hitchhiked variants are young and thus rare). This pushes Tajima’s D strongly negative. Maize domestication sweeps should show D < −2 in affected regions, while teosinte in the same regions shows D near zero.
🌿
Expected result for maize: Genome-wide, maize π ≈ 0.007 and teosinte π ≈ 0.011 — maize has ~36% less diversity than teosinte due to the domestication bottleneck. At the tb1 locus on chr1, maize π drops to nearly zero, confirming a hard sweep at this key domestication gene.

Step 4 — SweeD Composite Likelihood Ratio Sweeps

SweeD (Pavlidis et al. 2013) scans the site frequency spectrum (SFS) for the signature of a selective sweep — an excess of low-frequency variants flanking a high-frequency derived allele.

Bash — Convert VCF to SweeD format and run genome-wide scan
######################################################################
## STEP 4: Run SweeD to detect selective sweeps via the CLR (Composite
## Likelihood Ratio) statistic.
## SweeD compares the observed site frequency spectrum (SFS) at each
## position to what we would expect under a neutral model.
## High CLR = SFS looks like a sweep happened here.
######################################################################

module load vcftools/0.1.16
## SweeD is typically compiled from source: https://github.com/alachins/sweed

## --- Extract maize samples only into a separate VCF ---
## SweeD analyses one population at a time
vcftools \
    --gzvcf 1_filtered/maize_filtered.vcf.gz \
    --keep 2_popfiles/maize.txt \   # only maize samples
    --recode \
    --out 5_sweed/maize_only

bgzip 5_sweed/maize_only.recode.vcf
tabix -p vcf 5_sweed/maize_only.recode.vcf.gz
mv 5_sweed/maize_only.recode.vcf.gz 5_sweed/maize_only.vcf.gz
mv 5_sweed/maize_only.recode.vcf.gz.tbi 5_sweed/maize_only.vcf.gz.tbi

## --- Run SweeD on each chromosome ---
## -name     : output file prefix
## -input    : input VCF
## -grid 2000 : evaluate CLR at 2000 equally spaced grid points per chromosome
##   (more grid points = finer resolution but slower)
## -maf 0.05  : ignore sites with minor allele frequency below 5%
## -folded    : use folded SFS (when ancestral allele is unknown)
##   (unfolded is more powerful if you know the outgroup/ancestral state)
## -pop 1     : analyse population 1 (maize is the only population here)
for CHR in 1 2 3 4 5 6 7 8 9 10; do
    echo "Running SweeD on chromosome ${CHR}..."
    SweeD \
        -name chr${CHR}_maize \
        -input 5_sweed/maize_only.vcf.gz \
        -region chr${CHR} \
        -grid 2000 \
        -maf 0.05 \
        -folded \
        -pop 1 \
        -output 5_sweed/
    echo "  Done: 5_sweed/SweeD_Report.chr${CHR}_maize"
done

## --- Concatenate all chromosomes into one file ---
## Header is only in the first file; tail -n +2 skips it for subsequent files
head -1 5_sweed/SweeD_Report.chr1_maize > 5_sweed/sweed_all_chr.txt
for CHR in 1 2 3 4 5 6 7 8 9 10; do
    tail -n +2 5_sweed/SweeD_Report.chr${CHR}_maize >> 5_sweed/sweed_all_chr.txt
done

## --- Find top 0.1% CLR positions (candidate sweep sites) ---
CUTOFF=$(awk 'NR>1 {print $3}' 5_sweed/sweed_all_chr.txt | \
         sort -n | \
         awk 'BEGIN{n=0} {vals[n++]=$1} END{print vals[int(n*0.999)]}')
echo "Top 0.1% CLR cutoff: ${CUTOFF}"

awk -v cut="${CUTOFF}" 'NR==1 || $3 >= cut' \
    5_sweed/sweed_all_chr.txt > 5_sweed/top_clr_sites.txt
wc -l 5_sweed/top_clr_sites.txt
Download:
SweeD parameters decoded
📈-grid 2000Number of candidate sweep positions to evaluate per chromosome. SweeD places 2,000 equally spaced points and calculates the CLR at each. For a maize chromosome (~300 Mb), this gives ~150 kb resolution. Increase to 5,000–10,000 for fine-mapping after an initial scan identifies candidates.
🌈-folded vs unfolded SFSThe SFS can be folded (ignoring which allele is ancestral) or unfolded (polarised using an outgroup). Unfolded SFS is more powerful but requires an outgroup to assign ancestral states. For maize, use teosinte as outgroup for an unfolded analysis. Use -folded when no reliable outgroup is available.
🎯CLR thresholdThe CLR statistic has no fixed threshold — always use an empirical percentile (top 1% or 0.1%). The absolute CLR value depends on sample size and SNP density. A CLR that is significant in one dataset cannot be compared to another dataset with different samples.

Step 5 — XP-EHH & iHS with rehh

Extended haplotype homozygosity (EHH) statistics detect incomplete sweeps where a selected haplotype is still increasing in frequency. They require phased genotype data.

Bash — Phase VCF with BEAGLE (required for rehh)
######################################################################
## STEP 5a: Phase genotypes with BEAGLE 5.4
## XP-EHH and iHS require phased haplotypes — each individual's two
## chromosomes must be assigned (e.g. "A|G" not "A/G")
## BEAGLE 5.4 performs statistical phasing using haplotype patterns
######################################################################

module load java/17    # BEAGLE runs as a Java .jar file
## Download BEAGLE 5.4: https://faculty.washington.edu/browning/beagle/beagle.html

BEAGLE="beagle.22Jul22.46e.jar"   # path to BEAGLE .jar file
VCF="1_filtered/maize_filtered.vcf.gz"

## --- Phase each chromosome separately (faster and uses less memory) ---
for CHR in 1 2 3 4 5 6 7 8 9 10; do
    echo "Phasing chromosome ${CHR}..."

    ## gt       : input VCF with unphased genotypes (GT field must be present)
    ## out      : output file prefix (creates .vcf.gz automatically)
    ## chrom    : only phase this chromosome (speeds up phasing)
    ## nthreads : number of CPU threads to use
    ## ne       : effective population size estimate for the phasing model
    ##   (for maize: ~1,000,000 based on population history)
    java -Xmx16g -jar ${BEAGLE} \
        gt=${VCF} \
        out=6_rehh/phased_chr${CHR} \
        chrom=chr${CHR} \
        nthreads=8 \
        ne=1000000

    echo "  Done: 6_rehh/phased_chr${CHR}.vcf.gz"
done
Download:
BEAGLE phasing parameters decoded
👥ne (effective population size)BEAGLE uses a Li-Stephens haplotype model that needs an estimate of Ne. For humans: ~10,000–20,000. For maize: ~500,000–1,000,000 (larger due to recent population growth after domestication). Getting Ne very wrong affects phasing accuracy but BEAGLE is robust to 2–5x errors.
🏠-Xmx16gJava memory allocation: 16 GB maximum heap. Phasing large chromosomes (>50k SNPs) can require 8–32 GB RAM. If BEAGLE crashes with OutOfMemoryError, increase this value.
R — XP-EHH and iHS with rehh package
######################################################################
## STEP 5b: Calculate XP-EHH and iHS using the rehh R package
## XP-EHH = Cross-Population Extended Haplotype Homozygosity
##   Compares haplotype homozygosity BETWEEN two populations
##   High XP-EHH in maize vs teosinte = maize has longer shared haplotypes
##   = evidence that a maize haplotype was under positive selection
## iHS = integrated Haplotype Score (within-population)
##   Compares derived vs ancestral haplotype lengths WITHIN one population
##   High |iHS| = one haplotype is much longer than expected = recent sweep
######################################################################

library(rehh)       # haplotype-based selection scan package
library(tidyverse)  # data manipulation and plotting
library(ggplot2)

## ---- Process one chromosome at a time (here: chromosome 1) --------
CHR <- "chr1"

## --- Step 1: Read phased VCF into rehh haplotype format ---
## data2haplohh() converts a phased VCF into rehh's internal format
## allele_coding = "01": encode alleles as 0 (ref) and 1 (alt)
## min_maf = 0.05: skip sites with minor allele freq < 5% (too rare to phase well)
## max_missing = 0.1: allow up to 10% missing genotypes
hap_maize <- data2haplohh(
    hap_file  = paste0("6_rehh/phased_", CHR, ".vcf.gz"),
    vcf_reader = "data.table",           # fast reader for large files
    allele_coding = "01",
    min_maf = 0.05,
    max_missing = 0.1,
    haplotype_columns = 1:150,           # columns 1-150 = maize samples (2 haplotypes each)
    verbose = TRUE
)

hap_teosinte <- data2haplohh(
    hap_file  = paste0("6_rehh/phased_", CHR, ".vcf.gz"),
    vcf_reader = "data.table",
    allele_coding = "01",
    min_maf = 0.05,
    max_missing = 0.1,
    haplotype_columns = 151:202,         # columns 151-202 = teosinte samples
    verbose = TRUE
)

## --- Step 2: Calculate iHS (within-population, maize only) ---
## scan_hh(): computes EHH and iHH (integrated EHH) at each SNP
scan_maize <- scan_hh(
    hap_maize,
    polarized = FALSE   # unpolarized: no ancestral allele assignment needed
)

## ihh2ihs(): converts iHH values to the standardised iHS score
## Standardisation removes the correlation between iHS and allele frequency
ihs_result <- ihh2ihs(
    scan_maize,
    freqbin = 0.025     # bin SNPs by frequency in 2.5% bins for standardisation
)

## --- Step 3: Calculate XP-EHH (cross-population, maize vs teosinte) ---
## scan_hh() for each population separately, then ies2xpehh()
scan_teosinte <- scan_hh(hap_teosinte, polarized = FALSE)

xpehh_result <- ies2xpehh(
    scan_pop1 = scan_maize,       # "test" population (domesticated)
    scan_pop2 = scan_teosinte,    # "reference" population (wild)
    popname1 = "maize",
    popname2 = "teosinte",
    include_freq = TRUE           # include allele frequency in output
)

## --- Step 4: Save results ---
write.csv(ihs_result$ihs,   "6_rehh/ihs_chr1_maize.csv",   row.names=FALSE)
write.csv(xpehh_result,     "6_rehh/xpehh_chr1.csv",       row.names=FALSE)

cat("iHS SNPs with |iHS| > 2:  ", sum(abs(ihs_result$ihs$IHS) > 2, na.rm=TRUE), "\n")
cat("XP-EHH SNPs > 2:          ", sum(xpehh_result$XPEHH > 2, na.rm=TRUE), "\n")
Download:
rehh XP-EHH and iHS decoded
🌟iHS (integrated Haplotype Score)Compares the area under the EHH decay curve for the derived vs ancestral allele at each SNP. If the derived allele is under positive selection, it rises to high frequency before EHH has decayed — giving a longer haplotype. |iHS| > 2 (2 standard deviations) is typically used as a significance threshold.
🌎XP-EHH (Cross-Population EHH)Compares how fast haplotype homozygosity decays in population 1 vs population 2 moving away from each SNP. If population 1 (maize) has a much longer haplotype than population 2 (teosinte) at a locus, it suggests selection drove that haplotype to high frequency specifically in maize after the split. Positive XP-EHH = evidence of sweep in maize.
📊freqbin = 0.025iHS is frequency-dependent — rarer alleles naturally have longer haplotypes. Standardisation by frequency bins removes this confounding effect. SNPs are grouped into 2.5% frequency bins and iHS is Z-score normalised within each bin.
🌿
tb1 is the golden standard: The tb1 locus on chr1 (~265 Mb in B73v4 coordinates) should show strongly positive XP-EHH in maize vs teosinte, strongly negative Tajima’s D in maize, high Fst, and high SweeD CLR. If your pipeline produces a peak at tb1, your analysis is working correctly.

Step 6 — Combine & Visualise in R

R — Multi-panel Manhattan plot of all sweep statistics
######################################################################
## STEP 6: Create a 4-panel genome-wide Manhattan plot showing:
##   Panel 1: Windowed Fst (maize vs teosinte)
##   Panel 2: Pi ratio (teosinte pi / maize pi) — enriched at sweeps
##   Panel 3: SweeD CLR statistic
##   Panel 4: XP-EHH statistic
## All panels share the same x-axis (genomic position)
######################################################################

library(tidyverse)
library(ggplot2)
library(patchwork)      # combines multiple ggplots into one figure

## ---- Load data -------------------------------------------------------

## Fst results
fst <- read.table("3_fst/maize_vs_teo.windowed.weir.fst", header=TRUE) %>%
    rename(chrom=CHROM, pos=BIN_START, fst=WEIGHTED_FST) %>%
    filter(fst >= 0)    # remove negative Fst estimates (set to 0 in practice)

## Pi results — calculate pi ratio (teosinte/maize) as a sweep indicator
pi_maize <- read.table("4_diversity/pi_maize.windowed.pi", header=TRUE) %>%
    select(CHROM, BIN_START, PI) %>% rename(pi_maize=PI)
pi_teo   <- read.table("4_diversity/pi_teosinte.windowed.pi", header=TRUE) %>%
    select(CHROM, BIN_START, PI) %>% rename(pi_teo=PI)

pi_ratio <- left_join(pi_maize, pi_teo, by=c("CHROM","BIN_START")) %>%
    mutate(pi_ratio = pi_teo / (pi_maize + 1e-8)) %>%   # add tiny value to avoid divide-by-zero
    rename(chrom=CHROM, pos=BIN_START)

## SweeD CLR
sweed <- read.table("5_sweed/sweed_all_chr.txt", header=TRUE) %>%
    rename(chrom=Position, pos=Chromosome, clr=Likelihood)

## XP-EHH (chr1 only here — repeat for all chromosomes in practice)
xpehh <- read.csv("6_rehh/xpehh_chr1.csv") %>%
    rename(pos=POSITION, xpehh=XPEHH) %>%
    mutate(chrom="chr1")

## ---- Helper function: standard Manhattan plot theme -----------------
manhattan_theme <- function() {
    theme_bw(base_size=11) +
    theme(
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(angle=45, hjust=1, size=8)
    )
}

## ---- Panel 1: Fst ---------------------------------------------------
p1 <- ggplot(fst, aes(x=pos/1e6, y=fst)) +
    geom_point(size=0.3, alpha=0.5, color="steelblue") +
    geom_hline(yintercept=quantile(fst$fst, 0.99), color="red", linetype="dashed") +
    facet_grid(~chrom, scales="free_x", space="free_x") +
    labs(y="Fst", x=NULL, title="Maize vs Teosinte — Sweep Statistics") +
    manhattan_theme()

## ---- Panel 2: Pi ratio ----------------------------------------------
p2 <- ggplot(pi_ratio, aes(x=pos/1e6, y=pi_ratio)) +
    geom_point(size=0.3, alpha=0.5, color="forestgreen") +
    geom_hline(yintercept=quantile(pi_ratio$pi_ratio, 0.99, na.rm=TRUE),
               color="red", linetype="dashed") +
    facet_grid(~chrom, scales="free_x", space="free_x") +
    labs(y="Pi ratio\n(teo/maize)", x=NULL) +
    manhattan_theme()

## ---- Panel 3: SweeD CLR ---------------------------------------------
p3 <- ggplot(sweed, aes(x=pos/1e6, y=clr)) +
    geom_point(size=0.3, alpha=0.5, color="darkorange") +
    geom_hline(yintercept=quantile(sweed$clr, 0.999, na.rm=TRUE),
               color="red", linetype="dashed") +
    facet_grid(~chrom, scales="free_x", space="free_x") +
    labs(y="SweeD CLR", x=NULL) +
    manhattan_theme()

## ---- Panel 4: XP-EHH (chr1) ----------------------------------------
p4 <- ggplot(xpehh, aes(x=pos/1e6, y=xpehh)) +
    geom_point(size=0.4, alpha=0.6, color="purple") +
    geom_hline(yintercept=2, color="red", linetype="dashed") +
    labs(y="XP-EHH\n(maize vs teo)", x="Position (Mb) — Chr1") +
    theme_bw(base_size=11)

## ---- Combine with patchwork -----------------------------------------
## p1 / p2 / p3 = stack panels vertically
final_plot <- p1 / p2 / p3 / p4 +
    plot_layout(heights=c(2,2,2,1.5))   # relative panel heights

ggsave("7_plots/manhattan_sweep_stats.png", final_plot,
       width=20, height=14, dpi=150)
Download:
Visualisation decoded
📈Pi ratio (teo/maize)Dividing teosinte π by maize π amplifies the signal at sweep loci. At a sweep, maize π drops to near zero while teosinte π stays normal, making the ratio very large. This is a simple but powerful complementary statistic to Fst.
🎯patchwork panel layoutThe patchwork package stacks ggplots vertically with / operators. plot_layout(heights=) controls relative panel heights. Aligning all panels to the same chromosomal x-axis allows easy visual identification of genomic regions where multiple statistics co-localise — these are the most convincing sweep candidates.
🔬
The convergence rule: A genuine selective sweep will show high signal in multiple statistics simultaneously — high Fst, high pi ratio, high CLR, and high XP-EHH all pointing to the same genomic location. A false positive typically appears in only one or two statistics. Always require at least 2–3 statistics to converge before calling a locus as a sweep candidate.