Introgression: ABBA-BABA & D-statistics

Patterson’s D-statistic · f3 & f4 admixture fractions · fd-statistic for local introgression · ADMIXTOOLS2 graph fitting · Dsuite multi-population tests

~120 min Advanced Bash / R
Byte

Overview

When two populations that diverged long ago come back into secondary contact, they can exchange genetic material — a process called introgression. The ABBA-BABA test (Green et al. 2010) detects this by looking for an asymmetry in the pattern of shared derived alleles between four populations arranged in a tree. This tutorial covers the complete pipeline from raw allele counts to population graph fitting.

Byte
The four-taxon test — what are ABBA and BABA patterns?
For a tree ((P1, P2), P3, Outgroup):
  • ABBA: P1 has ancestral allele (A), P2 and P3 share derived allele (B) — suggests P2–P3 gene flow
  • BABA: P1 and P3 share derived allele (B), P2 has ancestral allele (A) — suggests P1–P3 gene flow
  • Under a simple bifurcating tree: ABBA count ≈ BABA count → D ≈ 0
  • If P3 introgressed into P2: more ABBA than BABA → D > 0, Z-score > 3

Ideal Directory Structure

introgression_human/
introgression_human/ ├── 0_raw/ # VCF from SGDP + Neanderthal ├── 1_filtered/ # filtered biallelic SNPs ├── 2_popmap/ # population assignment files │ ├── popmap.txt # sample → population mapping (for Dsuite) │ └── tree.nwk # guide tree (Newick format) ├── 3_dsuite/ # Dsuite Dtrios output ├── 4_admixtools/ # ADMIXTOOLS2 f3/f4/qpGraph ├── 5_fd/ # fd windowed introgression map └── 6_plots/ # final figures

Dataset

We use the Simons Genome Diversity Project (SGDP) modern human panel combined with the high-coverage Altai Neanderthal genome to replicate the canonical Neanderthal introgression analysis. This is the best-validated introgression dataset in all of genomics.

PropertyValue
Modern humansSGDP — 279 individuals from 142 populations (PRJEB9586)
NeanderthalAltai Neanderthal (high-coverage, PRJNA20345)
OutgroupChimpanzee (Pan troglodytes) or Denisovan for polarisation
ReferenceGRCh38
Expected D resultD > 0 for all non-African populations vs African (Neanderthal introgression ~1–4%)
PublicationMeyer et al. 2012 Science; Mallick et al. 2016 Nature
Four-taxon setup: P1 = African population (e.g. Yoruba), P2 = non-African population (e.g. European/Asian), P3 = Altai Neanderthal, Outgroup = Chimpanzee. If Neanderthal introgressed into non-Africans after the out-of-Africa migration, we expect more ABBA (P2–P3 share derived allele) than BABA → D > 0.

Step 1 — Download & Prepare Data

Bash — Download SGDP VCF + Neanderthal and merge
######################################################################
## STEP 1: Download the SGDP modern human + Neanderthal VCF and prepare
## for introgression analysis.
## We need a single multi-sample VCF containing:
##   1. Modern human diversity panel (SGDP)
##   2. Altai Neanderthal (as an individual in the VCF)
##   3. Outgroup alleles (chimpanzee or Denisovan)
######################################################################

mkdir -p 0_raw 1_filtered 2_popmap 3_dsuite 4_admixtools 5_fd 6_plots
module load bcftools/1.17
module load vcftools/0.1.16

## --- Download the SGDP callset (chr1 for testing; ~1.5 GB per chromosome) ---
## The full SGDP VCF is available from EBI at PRJEB9586
## Here we use the publicly available chr-by-chr VCFs
wget -P 0_raw/ \
  ftp://ngs.sanger.ac.uk/production/sgdp/phase3-shapeit2-mvncall-integrated-hg38-PASS-chrs/\
  chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget -P 0_raw/ \
  ftp://ngs.sanger.ac.uk/production/sgdp/phase3-shapeit2-mvncall-integrated-hg38-PASS-chrs/\
  chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi

## --- Pre-filter: biallelic SNPs, MAF > 1%, < 5% missing ---
## --type snps           : SNPs only (no indels, no MNPs)
## --min-alleles 2       : minimum 2 alleles (= at least 1 alt allele)
## --max-alleles 2       : maximum 2 alleles (biallelic only)
## --min-af 0.01:minor   : minor allele frequency >= 1%
## --include             : bcftools expression to filter missing data
bcftools view \
    --type snps \
    --min-alleles 2 --max-alleles 2 \
    --min-af 0.01:minor \
    0_raw/chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
  bcftools view \
    --include 'F_MISSING < 0.05' \
    --output-type z \
    --output 1_filtered/sgdp_chr1_filtered.vcf.gz

## Index the filtered VCF
bcftools index --tbi 1_filtered/sgdp_chr1_filtered.vcf.gz

## --- Check sample count ---
echo "Number of samples: $(bcftools query -l 1_filtered/sgdp_chr1_filtered.vcf.gz | wc -l)"
echo "Number of SNPs:    $(bcftools view --no-header 1_filtered/sgdp_chr1_filtered.vcf.gz | wc -l)"

## --- Create the population mapping file for Dsuite ---
## Format: two columns (tab-separated):
##   Column 1: sample ID (must match VCF header exactly)
##   Column 2: population name (e.g. Yoruba, French, Han, Neanderthal)
## Neanderthal sample should be assigned the special label "Outgroup" or its own group
## Use SGDP metadata from:
## https://www.simonsfoundation.org/simons-genome-diversity-project/

## Example: extract sample list and assign populations from metadata
bcftools query -l 1_filtered/sgdp_chr1_filtered.vcf.gz > 2_popmap/all_samples.txt

## You need to manually create popmap.txt based on SGDP metadata
## Format example (first 5 lines):
## S_Yoruba-1    Yoruba
## S_Yoruba-2    Yoruba
## S_French-1    French
## S_Han-1       Han
## AltaiNea      Neanderthal   ← the Neanderthal sample
echo "Please create 2_popmap/popmap.txt manually based on SGDP metadata."
echo "Then create 2_popmap/tree.nwk: ((Yoruba,(French,Han)),Neanderthal,Chimp);"
Download:
Data preparation decoded
👥Population map fileThe popmap.txt file tells Dsuite which sample belongs to which population. Each line has: sample_id [TAB] population_name. The outgroup population must be listed last in the guide tree, and typically one population is designated as Outgroup for polarising alleles as ancestral/derived.
🌳Guide treeA Newick-format tree specifying the known or expected relationships between populations. Dsuite uses this tree to determine which trio combinations of populations to test (P1, P2, P3). The topology should reflect the species/population phylogeny — incorrect topology can obscure real introgression signals.

Step 2 — Dsuite: All-Population D-statistics

Dsuite (Martin et al. 2021) efficiently computes D-statistics for all possible trio combinations across a multi-population dataset. It uses the guide tree to organise results.

Bash — Dsuite Dtrios and Fbranch analysis
######################################################################
## STEP 2: Run Dsuite to compute D-statistics for ALL trio combinations
## This is the most efficient way to screen a multi-population dataset
## for gene flow. Dsuite tests every possible (P1, P2, P3) combination
## and produces a matrix of D-statistics and Z-scores.
######################################################################

## Dsuite installation: https://github.com/millanek/Dsuite
## Compile with: make
## Add to PATH: export PATH="/path/to/Dsuite/Build:$PATH"

VCF="1_filtered/sgdp_chr1_filtered.vcf.gz"
POPMAP="2_popmap/popmap.txt"
TREE="2_popmap/tree.nwk"

## --- Dsuite Dtrios: test all (P1, P2, P3) trios ---
## Dtrios     : the subcommand for all-trio D-statistic testing
## -t TREE    : guide tree — tells Dsuite which population is outgroup
##              and organises trios according to tree topology
## -n sgdp    : output file prefix (output saved in 3_dsuite/)
## The last argument is the VCF file
## NOTE: Dsuite automatically identifies the Outgroup population from the tree
Dsuite Dtrios \
    -t ${TREE} \
    -n 3_dsuite/sgdp \
    ${VCF} \
    ${POPMAP}

## Output files created:
##   3_dsuite/sgdp_BBAA.txt    ← main D-statistic results
##   3_dsuite/sgdp_Dmin.txt    ← minimum D among trios with the same P3
##   3_dsuite/sgdp_tree.txt    ← results ordered by guide tree

## --- View results: columns are P1, P2, P3, Dstatistic, Z-score, p-value ---
## Z-score > 3 is the conventional significance threshold (equivalent to p < 0.001)
echo "=== Top introgression signals (|Z-score| > 3) ==="
awk 'NR>1 && ($5 > 3 || $5 < -3) {print}' 3_dsuite/sgdp_BBAA.txt | \
    sort -k5 -rn | head -20

## --- Dsuite Fbranch: estimate which branch received gene flow ---
## Fbranch uses the BBAA file + the guide tree to determine which
## specific branch in the tree is the source/recipient of introgression
## This is more informative than raw D-statistics
Dsuite Fbranch \
    ${TREE} \
    3_dsuite/sgdp_BBAA.txt \
    > 3_dsuite/sgdp_Fbranch.txt

echo "Fbranch results saved in 3_dsuite/sgdp_Fbranch.txt"
Download:
Dsuite commands decoded
📈BBAA.txt vs Dmin.txtBBAA.txt contains results for the trio arrangement where the BBAA count (P2 and P3 share derived allele) exceeds BABA and ABBA — this is typically the topology-consistent arrangement. Dmin.txt reports the minimum D value among all arrangements of a trio — a conservative estimator that reduces false positives from tree misspecification.
🌳FbranchFbranch post-processes Dtrios output to assign introgression signals to specific branches in the tree. Instead of asking “did P2 and P3 exchange genes?”, it asks “which branch of the tree carries the excess gene flow?”. This is essential for distinguishing direct introgression from shared ancient introgression.
🎯Z-score thresholdZ-score > 3 (≈ p < 0.001) is the conventional threshold for significant D-statistics. Z-scores are computed by jackknifing across blocks of the genome. More blocks = more stable Z-scores. For whole-genome data, use blocks of 5 Mb. For single-chromosome tests, results may be less stable.
👥
Expected Neanderthal result: All non-African vs African trios with Neanderthal as P3 should show D > 0 with Z-score > 3. African populations should show D ≈ 0. This asymmetry — Neanderthal alleles in non-Africans but not Africans — is the original 2010 evidence for Neanderthal introgression published by Green et al.

Step 3 — ADMIXTOOLS2: f3 and f4 Statistics

R — ADMIXTOOLS2 f3 (admixture test) and f4 (D-statistic)
######################################################################
## STEP 3: Use ADMIXTOOLS2 to compute f3 and f4 statistics
##
## f3(A; B, C) = outgroup f3 = measures shared drift between B and C
##   relative to outgroup A.
##   Negative f3 = A is an ADMIXTURE of B and C (admixture test)
##
## f4(A, B; C, D) = equivalent to Patterson's D statistic
##   = tests if (A,B) or (A,C) share more derived alleles with D
##   Non-zero f4 = evidence of gene flow between (B,C) or (A,D)
######################################################################

## Install ADMIXTOOLS2 in R:
## devtools::install_github("uqrmaie1/admixtools")

library(admixtools)    # ADMIXTOOLS2 R package
library(tidyverse)

## --- Step 1: Build the SNP data object from VCF ---
## This pre-computes allele frequency data and stores it in a compact format
## for fast f-statistic calculations
prefix <- "1_filtered/sgdp_chr1_filtered"   # VCF prefix (without .vcf.gz)
outdir <- "4_admixtools/"

## Convert VCF to ADMIXTOOLS2 format (creates .geno, .snp, .ind files)
## vcf2eigenstrat() is a helper that calls plink internally
## This step may take 10-30 minutes for large datasets
vcf2eigenstrat(
    vcf = paste0(prefix, ".vcf.gz"),
    out = paste0(outdir, "sgdp_eigenstrat")
)

## --- Step 2: Load the data ---
## read_plink() reads PLINK BED format; read_eigenstrat() for EIGENSTRAT format
## popinfo: data frame mapping individual IDs to populations (from popmap.txt)
popinfo <- read.table("2_popmap/popmap.txt",
                      col.names = c("ind", "pop"), sep="\t")

data <- read_eigenstrat(paste0(outdir, "sgdp_eigenstrat"), pops=popinfo)

## --- Step 3: Compute f4 statistics (= Patterson's D) ---
## f4(Yoruba, NonAfrican; Neanderthal, Chimp)
## If non-African populations introgressed with Neanderthal:
##   f4 should be NEGATIVE (excess shared derived alleles between NonAfrican and Neanderthal)
## Define the four populations to test:
f4_result <- f4(
    data,
    pop1 = "Yoruba",          # P1: African outgroup (no Neanderthal introgression)
    pop2 = "French",          # P2: non-African test population
    pop3 = "Neanderthal",     # P3: putative source of introgression
    pop4 = "Chimp",           # P4: outgroup for polarising derived alleles
    boot = TRUE               # bootstrap across SNP blocks for Z-scores
)

## --- View results ---
cat("f4 estimate:", f4_result$est, "\n")
cat("Z-score:    ", f4_result$z, "\n")
cat("p-value:    ", 2*pnorm(-abs(f4_result$z)), "\n")
## Significant f4 < 0 = evidence of Neanderthal gene flow into French

## --- Step 4: Compute f3 admixture test ---
## f3(Test; Source1, Source2)
## Negative f3 = Test population is admixed between Source1 and Source2
f3_result <- f3(
    data,
    pop1 = "French",          # Test: the potentially admixed population
    pop2 = "Yoruba",          # Source 1: African-like source
    pop3 = "Neanderthal",     # Source 2: Neanderthal-like source
    boot = TRUE
)

cat("\nf3 estimate:", f3_result$est, "\n")
cat("Z-score:    ", f3_result$z, "\n")
## Significantly negative f3 = French is admixed between African + Neanderthal sources

## --- Step 5: Scan f4 across all population pairs ---
## This tests all combinations of populations systematically
f4_all <- f4blockjack(
    data,
    pop1 = "Yoruba",
    pop4 = "Chimp",
    pops = unique(popinfo$pop)   # test all populations as P2 and P3
)

write.csv(f4_all, paste0(outdir, "f4_all_populations.csv"), row.names=FALSE)
cat("\nf4 results for all population pairs saved.\n")
Download:
f3 and f4 statistics decoded
📊f4 vs D-statisticf4 and Patterson’s D are mathematically equivalent — they both measure the same ABBA-BABA asymmetry. ADMIXTOOLS2 f4 uses a slightly different normalisation and reports an absolute estimate of shared genetic drift rather than a ratio. Both give the same sign and significance.
Negative f3 = admixtureThe f3 admixture test is one of the most powerful tests in population genomics. If f3(A; B, C) is significantly negative, population A cannot be explained by a simple tree relationship with B and C — it must be an admixture of B-like and C-like sources. This is a necessary condition for admixture, not sufficient — false negatives occur when the admixture is very ancient.
🎯boot = TRUEEnables block jackknife resampling to estimate standard errors and Z-scores. Blocks of ~5 Mb are removed one at a time and the statistic is re-computed. This captures the correlation structure of the genome (LD) in the standard error estimate. Without bootstrapping, standard errors are underestimated.

Step 4 — fd-Statistic: Local Introgression Map

The fd-statistic (Martin et al. 2015) is a windowed version of D that estimates the proportion of the genome introgressed in each window. This creates a map of where Neanderthal sequence survives in modern human genomes.

Python — fd-statistic using Simon Martin's genomics_general scripts
######################################################################
## STEP 4: Calculate windowed fd-statistic (Martin et al. 2015)
## fd estimates the fraction of introgression in each genomic window
## This gives a fine-scale map of which windows are enriched for
## Neanderthal (or any donor population) ancestry
##
## Uses Simon Martin's genomics_general Python scripts:
## https://github.com/simonhmartin/genomics_general
######################################################################

## Clone the genomics_general repository
git clone https://github.com/simonhmartin/genomics_general.git
## Add to Python path:
export PYTHONPATH="${PWD}/genomics_general:${PYTHONPATH}"

## --- Step 1: Convert VCF to GENO format (required by genomics_general) ---
## parseVCF.py: converts VCF to a simpler per-site genotype format
## --minQual 30 : minimum site QUAL score of 30
## -i vcf       : input format is VCF
## -o alleles   : output format = one allele per column per sample
python genomics_general/VCF_processing/parseVCF.py \
    --minQual 30 \
    -i 1_filtered/sgdp_chr1_filtered.vcf.gz \
    | bgzip > 5_fd/sgdp_chr1.geno.gz

## Index the geno file
bgzip -r 5_fd/sgdp_chr1.geno.gz

## --- Step 2: Run windowed fd-statistic ---
## ABBABABAwindows.py: the core script for ABBA-BABA window analysis
## -g        : input GENO file
## -o        : output file
## -f        : output format (csv)
## -w 50000  : window size = 50,000 bp (50 kb)
## -m 50     : minimum number of SNPs per window (skip windows with <50 SNPs)
## -s 25000  : step size = 25,000 bp (50% overlap between windows)
## -p P1 Yoruba : assign samples to population P1 (African reference)
## -p P2 French : assign samples to P2 (non-African test population)
## -p P3 Neanderthal : P3 = putative introgression donor
## -p O  Chimp : outgroup for ancestral allele polarisation
## --fd      : also compute fd in addition to D (fd is the preferred windowed metric)
python genomics_general/ABBABABAwindows.py \
    -g 5_fd/sgdp_chr1.geno.gz \
    -o 5_fd/fd_windows_chr1.csv \
    -f csv \
    -w 50000 \
    -m 50 \
    -s 25000 \
    -p P1 Yoruba \
    -p P2 French \
    -p P3 Neanderthal \
    -p O  Chimp \
    --popsFile 2_popmap/popmap.txt \
    --T 8 \
    --fd

## --- View output columns ---
## scaffold, start, end, mid, sites, D, fd
head 5_fd/fd_windows_chr1.csv
echo "Windows computed: $(wc -l < 5_fd/fd_windows_chr1.csv)"

## --- Identify top 5% windows with most Neanderthal introgression ---
## fd > 0 and D > 0 = excess of Neanderthal-like alleles in this window
awk -F',' 'NR>1 && $7 > 0' 5_fd/fd_windows_chr1.csv | \
    sort -t',' -k7 -rn | head -20
Download:
fd-statistic decoded
📊fd vs DThe genome-wide D-statistic is a single average — it cannot tell you where in the genome introgression occurred. fd is a windowed estimator that also corrects the denominator to be the maximum possible D under complete introgression, making it interpretable as a proportion (0 = no introgression, 1 = 100% introgressed). Use fd for mapping introgression, D for testing significance.
🎯-m 50 minimum SNPsWindows with very few SNPs (especially on chromosome arms near centromeres and telomeres) produce unreliable fd estimates. Setting a minimum of 50 SNPs per window removes these. Windows with fewer SNPs show artificially extreme fd values due to statistical noise.
🧪
Known Neanderthal deserts and islands: The X chromosome and regions around centromeres show near-zero Neanderthal ancestry (Neanderthal “deserts”), while some autosomal regions show elevated fd > 0.1 — these are the “Neanderthal islands” that survived selection. Your fd map for chr1 should match published maps from Sankararaman et al. 2014 or Vernot & Akey 2014.

Step 5 — ADMIXTOOLS2: qpGraph Population Graph Fitting

R — Fit a population graph with migration edges using qpGraph
######################################################################
## STEP 5: Fit a population graph (admixture graph) using ADMIXTOOLS2
## qpGraph fits a model that explains the observed f-statistics
## with a tree + migration edges (admixture proportions).
## This is how we quantify HOW MUCH Neanderthal ancestry is in
## non-African populations.
######################################################################

library(admixtools)
library(tidyverse)

## Load the data (same as Step 3)
popinfo <- read.table("2_popmap/popmap.txt", col.names=c("ind","pop"), sep="\t")
data <- read_eigenstrat("4_admixtools/sgdp_eigenstrat", pops=popinfo)

## --- Define the population graph model ---
## The graph is specified as a list of edges:
##   "edge from to" = tree branch from parent to child
##   "admix child parent1 parent2 proportion" = admixture edge
## We test this model:
##   Simple tree: ((Yoruba, (French, Han)), Neanderthal, Chimp)
##   + one admixture edge: Neanderthal → ancestor of French+Han
##   The proportion of this edge = estimated Neanderthal ancestry fraction

## Define graph topology as R object
## Each list element is: c(child_node, parent_node_1, parent_node_2, proportion)
graph <- qpgraph(
    data = data,
    ## Define graph: pops + admixture edges
    ## Format: edges defined as parent → child in tibble
    edges = tribble(
      ~from,          ~to,            ~weight,
      "root",         "ancestral_AA", NA,        # root splits into African ancestor
      "root",         "Nea",          NA,        #   and Neanderthal lineage
      "ancestral_AA", "Yoruba",       NA,        # Yoruba = unadmixed African
      "ancestral_AA", "nonAfrAnc",    NA,        # ancestor of non-Africans
      "nonAfrAnc",    "French",       NA,        # French
      "nonAfrAnc",    "Han",          NA,        # Han Chinese
      "Nea",          "AltaiNea",     NA         # Altai Neanderthal
    ),
    ## Admixture edge: Neanderthal introgression into non-African ancestor
    ## prop = proportion to estimate (between 0 and 1)
    admix_edges = tribble(
      ~from,   ~to,         ~prop,
      "Nea",   "nonAfrAnc", 0.02   # starting value: 2% Neanderthal ancestry
    ),
    outgroup = "Chimp",    # all f-statistics computed relative to this outgroup
    verbose = TRUE
)

## --- Fit the graph ---
## This optimises the branch lengths and admixture proportions to minimise
## the discrepancy between observed and expected f-statistics
fit <- fit_graph(graph, data)

## --- View results ---
cat("Neanderthal admixture proportion:", fit$opt_edge_weights$w, "\n")
cat("Graph fit score (worst Z-score):  ", fit$score, "\n")
## A good fit has |Z-score| < 3 for all f-statistics (fit$score close to 0)

## --- Save results ---
saveRDS(fit, "4_admixtools/qpgraph_fit.rds")
write.csv(fit$f3, "4_admixtools/fitted_f3.csv")
Download:
qpGraph decoded
📈fit$scoreThe fit score is the maximum absolute Z-score among all f-statistics. A score < 3 means no f-statistic differs significantly from the model prediction — the graph fits the data well. A score > 3 means the graph is missing at least one admixture event. Add more migration edges until the score drops below 3.
🎯Admixture proportion estimateThe fitted weight on the Neanderthal → non-African admixture edge is the estimated fraction of Neanderthal ancestry (~0.02 = 2%). This should match the ~1–4% range from published studies. If your estimate is far outside this range, check your outgroup specification and ancestry designation.

Step 6 — Visualise Results

R — Heatmap of D-statistics + fd genome map
######################################################################
## STEP 6: Create publication-quality figures:
##   Figure 1: Heatmap of D-statistics across all population pairs
##   Figure 2: Genome-wide fd map (Neanderthal ancestry per window)
######################################################################

library(ggplot2)
library(tidyverse)
library(viridis)      # colour-blind friendly colour scales

## ---- Figure 1: D-statistic heatmap ---------------------------------
## Load Dsuite results
dstats <- read.table("3_dsuite/sgdp_BBAA.txt", header=TRUE)
## Columns: P1, P2, P3, Dstatistic, Z-score, p-value, f_D, ...

## Filter: only trios with Neanderthal as P3
nea_trios <- dstats %>%
    filter(P3 == "Neanderthal") %>%
    mutate(
        significant = abs(Zscore) > 3,   # TRUE if Z-score exceeds threshold
        D_clipped = pmax(-0.5, pmin(0.5, Dstatistic))  # clip extreme values for display
    )

## Heatmap: P1 vs P2, coloured by D-statistic
## Red = positive D (P2 shares more derived alleles with Neanderthal than P1)
ggplot(nea_trios, aes(x=P1, y=P2, fill=D_clipped)) +
    geom_tile() +
    geom_text(aes(label=ifelse(significant, "*", "")), size=3) +
    scale_fill_gradient2(
        low="blue", mid="white", high="red", midpoint=0,
        limits=c(-0.5, 0.5), name="D-statistic"
    ) +
    labs(
        title="D-statistics (P3 = Altai Neanderthal, Outgroup = Chimp)",
        subtitle="* = |Z-score| > 3 (significant)",
        x="P1 population", y="P2 population"
    ) +
    theme_bw(base_size=11) +
    theme(axis.text.x = element_text(angle=45, hjust=1, size=8))

ggsave("6_plots/dstats_heatmap.png", width=12, height=10, dpi=150)

## ---- Figure 2: fd genome map for chr1 ------------------------------
fd_data <- read.csv("5_fd/fd_windows_chr1.csv") %>%
    filter(!is.na(fd), fd >= 0, fd <= 1)   # keep valid fd values (0 to 1)

ggplot(fd_data, aes(x=mid/1e6, y=fd)) +
    ## Points: each window coloured by D-statistic
    geom_point(aes(color=D), size=0.5, alpha=0.7) +
    ## Smoothed trend line to show regional patterns
    geom_smooth(color="black", se=FALSE, method="loess", span=0.05, linewidth=0.8) +
    scale_color_viridis_c(option="plasma", name="D-statistic") +
    labs(
        x="Position on Chr1 (Mb)",
        y="fd (Neanderthal introgression fraction)",
        title="Local Neanderthal ancestry across Chromosome 1 (French vs Yoruba)"
    ) +
    theme_bw(base_size=12)

ggsave("6_plots/fd_chr1_map.png", width=14, height=5, dpi=150)
Download:
Visualisation decoded
🌶geom_tile() heatmapA tile plot where each cell is coloured by the D-statistic between a P1–P2 pair (with Neanderthal as P3). Asterisks mark statistically significant values. Red cells = P2 shares more alleles with Neanderthal than P1 (evidence of introgression specifically in P2). The diagonal is empty (can’t compare a population to itself).
🌈scale_color_viridis_c(option="plasma")Viridis colour scales are perceptually uniform and accessible to colour-blind readers. The "plasma" option goes from dark purple (low) to yellow (high) and is ideal for continuous data. Always use perceptually uniform colour scales for genomic maps.