Whole Genome Duplication & Polyploidy Analysis

WGD concepts explained · Ks plots · MCScanX synteny blocks · WGDI collinearity · dot plot interpretation · subgenome phasing · fractionation bias

~150 min Advanced Bash / Python / R
Arabidopsis thaliana TAIR10 & Brassica napus Darmor-bzh — NCBI/EnsemblPlants
Byte

WGD Concepts Explained

Whole Genome Duplication (WGD) is an event where an organism's entire genome is copied — either by autopolyploidy (doubling within one species) or allopolyploidy (hybridisation between two species followed by chromosome doubling). WGD is ubiquitous in plants: virtually every flowering plant lineage has experienced at least one ancient WGD event. Understanding WGD history is fundamental for interpreting genome size, gene family expansion, and duplicate gene evolution.

Byte
Three flavours of polyploidy to distinguish:
  • Autopolyploidy: Whole-genome doubling within one species (e.g. potato S. tuberosum 4x; banana Musa 3x). Both copies originate from the same ancestral genome. After millions of years of diploidisation they become invisible cytologically.
  • Allopolyploidy: Hybridisation of two distinct species whose merged genome then doubles (e.g. bread wheat = A+B+D subgenomes; Arabidopsis suecica = A. thaliana + A. arenosa; Brassica napus = B. rapa + B. oleracea). The two parental subgenomes can often be distinguished by GC content differences, transposon landscape, and gene expression dominance.
  • Paleopolyploidy: Ancient WGD that has been mostly diploidised — the organism looks cytologically diploid but its genome contains remnant duplicated synteny blocks revealing the ancient event. Example: Arabidopsis thaliana (2n=10) underwent a WGD ~35 Mya (α event) and still carries paralogous chromosomal blocks.
Evidence typeToolWhat it showsLimitation
Ks distributionwgd, KaKs_CalculatorPeak in synonymous substitution histogram = synchronised duplication burst = WGDKs saturates at >2.0; very old WGDs invisible
Synteny / collinearityMCScanX, WGDI, SynMap2Duplicated chromosomal blocks within the same genome = remnant WGD segmentsGene loss (fractionation) erodes old blocks over time
Dot plotWGDI, MCScan, jcvi2x2 synteny pattern = 1 WGD; 4x4 = 2 WGDs stackedMeaningful only with chromosome-level assemblies
Chromosome countCytologyMultiples of base chromosome number x suggest polyploidy historyDiploidisation obscures history; not bioinformatic
Subgenome assignmentSubPhaser, ASTRAL, homeolog SNPsWhich chromosomes belong to which ancestral parental subgenomeRequires sufficient divergence between subgenomes

Ks as a Molecular Clock for WGD Dating

Ks (synonymous substitution rate) counts the number of silent substitutions per synonymous site between two homologous coding sequences. Because synonymous mutations do not change the amino acid sequence, they accumulate at a roughly neutral, constant rate — acting as a molecular clock.

Ks valueApproximate age (plants)What to expect
0.0 – 0.05Very recent (<3 Mya)Tandem duplicates; very recent polyploidisation events
0.05 – 0.33–20 MyaRecent WGDs: e.g. legume WGD (~13 Mya), Brassiceae WGD (~7 Mya)
0.3 – 1.020–60 MyaOlder WGDs: Arabidopsis α (~35 Mya), grass ρ (~70 Mya)
1.0 – 2.060–100 MyaKs approaching saturation; core eudicot γ triplication (~100 Mya)
>2.0>100 MyaKs saturated — all synonymous sites may be multiple-hit; signal unreliable

Use the live Ks calculator below to convert any Ks peak value to a WGD age estimate. Edit the Ks value and the substitution rate to see how the age changes — this makes the molecular clock formula completely transparent:

Reading a Ks plot
📊A sharp peak at a specific Ks value= WGD signal. All duplicated gene pairs from a WGD event accumulated the same amount of synonymous substitutions since the duplication → they form a peak in the histogram.
📊A shoulder or secondary peak= evidence of an older WGD layered underneath a newer one (e.g. maize has peaks at Ks~0.05 for the recent maize WGD and Ks~1.5 for the ancient grass WGD).
📊A broad flat distribution, no peak= background of tandem/segmental duplicates; no WGD. Use synteny blocks to confirm.
Converting Ks to timeTime (Mya) = Ks ÷ (2 × r), where r = synonymous substitution rate per site per year. For plants: r ≈ 1.5–7 × 10−9 (highly variable). Always calibrate using paralog pairs with a known fossil/biogeographic date in your own lineage.

Step 1 — All-vs-All BLAST: Finding Paralogous Gene Pairs

Bash — Self-BLAST + longest-isoform filtering
######################################################################
## STEP 1: All-vs-all protein BLAST to identify all paralog pairs
## Self-BLAST: we BLAST every protein against every other protein
## in the same genome, then filter for likely paralogs.
##
## Dataset: Arabidopsis thaliana TAIR10
##   Known WGDs: alpha (~35 Mya), beta (~100 Mya), gamma triplication (~140 Mya)
##   27,655 protein-coding genes; 5 chromosomes; 135 Mb
######################################################################

## Download Arabidopsis TAIR10 protein FASTA
wget "https://www.arabidopsis.org/download_files/Proteins/TAIR10_protein_lists/TAIR10_pep_20101214_updated" \
    -O Athaliana_TAIR10.faa

## Step 1a: Keep only the longest isoform per gene
## Using all isoforms creates false duplicate pairs (same gene, different splice forms)
python3 - <<'PYEOF'
from Bio import SeqIO
seqs = {}
for rec in SeqIO.parse("Athaliana_TAIR10.faa", "fasta"):
    gene = rec.id.split(".")[0]   # gene ID = everything before first "."
    if gene not in seqs or len(rec.seq) > len(seqs[gene].seq):
        seqs[gene] = rec
with open("Athaliana_longest.faa", "w") as f:
    SeqIO.write(seqs.values(), f, "fasta")
print(f"Wrote {len(seqs)} proteins (longest isoform per gene)")
PYEOF

## Step 1b: Build BLAST database
module load blast+/2.14.0
makeblastdb -in Athaliana_longest.faa -dbtype prot -out Athaliana_db

## Step 1c: Run self-BLAST (query == database)
## -max_target_seqs 20: keep top 20 hits per query
##   WGD analysis does not need all distant homologs
## -perc_identity 40: 40% is permissive enough to catch ancient WGD pairs
##   (very old paralogs from Ks~1.5 WGDs have ~50-60% identity at protein level)
blastp \
    -query   Athaliana_longest.faa \
    -db      Athaliana_db \
    -evalue  1e-5 \
    -num_threads 16 \
    -max_target_seqs 20 \
    -outfmt  "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" \
    -perc_identity 40 \
    -out     Athaliana_selfblast.txt
echo "Raw BLAST hits: $(wc -l < Athaliana_selfblast.txt)"

## Step 1d: Remove self-hits and keep best hit per gene pair
## awk '$1 != $2': remove lines where query == subject (self-hit)
## sort + awk '!seen[]': keep only the top-scoring hit per pair
awk '$1 != $2' Athaliana_selfblast.txt | \
    sort -k1,1 -k2,2 -k12,12rn | \
    awk '!seen[$1"_"$2]++' > Athaliana_best_pairs.txt
echo "Unique paralog pairs: $(wc -l < Athaliana_best_pairs.txt)"
Download:
🎯
Always use the longest isoform! Alternative transcripts from the same gene appear as apparent "paralogs" in BLAST output. Using all isoforms inflates the Ks distribution with Ks~0 pairs (identical protein = Ks=0) and creates a false spike at the origin. This is the single most common error in Ks analysis pipelines.

Step 2 — Ks Calculation & WGD Peak Identification

Python + R — wgd suite Ks pipeline + Gaussian mixture model peak fitting
######################################################################
## STEP 2: Calculate Ks values and identify WGD peaks
## Tool: wgd Python suite (https://github.com/arzwa/wgd)
## wgd handles: MCL clustering -> CDS alignment -> Ks calculation
## -> mixture model fitting to identify how many WGD peaks exist
## Install: pip install wgd
######################################################################

## Also download CDS sequences (required for synonymous rate calculation)
wget "https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_blastsets/TAIR10_cds_20101214_updated" \
    -O Athaliana_CDS.fna

## Step A: MCL gene family clustering based on BLAST homology
## --inflation 1.5: MCL inflation parameter (higher = more, smaller families)
## --eval 1e-5:     BLAST E-value cutoff
wgd dmd Athaliana_longest.faa \
    --inflation 1.5 \
    --eval 1e-5 \
    --aln_tool mafft \
    -o wgd_dmd/

## Step B: Calculate pairwise Ks within each gene family
## wgd ksd aligns each gene pair at the codon level and uses PAML yn00
## (or codeml) to estimate Ka and Ks from the codon alignment
wgd ksd wgd_dmd/*.mcl.tsv Athaliana_longest.faa \
    --cds Athaliana_CDS.fna \
    -o wgd_ks/

## Step C: Gaussian mixture model to decompose WGD peaks
## -n 1:5 = try 1 to 5 Gaussian components; AIC/BIC selects best
wgd mix wgd_ks/ks.tsv \
    --n_range 1 5 \
    -o wgd_mix/

## Outputs:
##   wgd_ks/ks.tsv        : all pairwise Ks values
##   wgd_mix/*.pdf        : Ks plot with fitted mixture components
##   wgd_mix/*.tsv        : fitted component means (= WGD peak Ks values)

## ============================================================
## R: Custom Ks distribution plot with WGD annotations
## ============================================================
cat <<'REOF' > plot_ks_distribution.R
library(ggplot2)
library(mclust)      # Gaussian mixture model fitting
library(tidyverse)

## Load Ks values
ks_raw <- read.delim("wgd_ks/ks.tsv", header=TRUE)

## Quality filters:
## - Ks >= 0.05: exclude very recent/tandem duplicates (Ks near 0 is unreliable)
## - Ks <= 3.0:  exclude saturated pairs (multiple hits at synonymous sites)
## - Ka/Ks <= 1: exclude positively selected pairs (Ks underestimated under positive selection)
ks_filt <- ks_raw %>%
    filter(Ks >= 0.05, Ks <= 3.0) %>%
    filter(is.na(Ka) | (Ka/Ks) <= 1.0)

cat("Paralog pairs after filtering:", nrow(ks_filt), "\n")

## Fit Gaussian mixture to log(Ks)
## Working in log space because Ks distributions are right-skewed
set.seed(42)
bic_fit  <- mclustBIC(log(ks_filt$Ks), G=1:5, modelNames="V")
best_fit <- Mclust(log(ks_filt$Ks), x=bic_fit)
cat("Best number of components:", best_fit$G, "\n")

## Report WGD peaks
peak_ks <- sort(exp(best_fit$parameters$mean))
props    <- best_fit$parameters$pro[order(best_fit$parameters$mean)]
cat("\nKs peaks (WGD candidates):\n")
for (i in seq_along(peak_ks)) {
    ## Convert Ks to approximate age using plant synonymous rate
    ## r = 1.5e-8 substitutions/site/year (Arabidopsis-calibrated)
    r_plant   <- 1.5e-8
    age_Mya   <- peak_ks[i] / (2 * r_plant) / 1e6
    cat(sprintf("  Peak %d: Ks=%.3f  (~%.0f Mya)  %.1f%% of pairs\n",
                i, peak_ks[i], age_Mya, props[i]*100))
}

## Build density data for plot
dens <- density(ks_filt$Ks, bw=0.05, from=0, to=3)

## Custom plot
ggplot(ks_filt, aes(x=Ks)) +
    geom_histogram(aes(y=after_stat(density)),
                   bins=120, fill="steelblue", color=NA, alpha=0.65) +
    geom_line(data=data.frame(x=dens$x, y=dens$y),
              aes(x=x, y=y), color="navy", linewidth=1) +
    ## Annotate WGD peaks (Arabidopsis-specific)
    geom_vline(xintercept=0.15, color="firebrick", linetype=2, linewidth=0.9) +
    geom_vline(xintercept=0.90, color="darkorange", linetype=2, linewidth=0.9) +
    annotate("text", x=0.20, y=2.8,
             label="alpha WGD\n~35 Mya", color="firebrick", size=3.5, hjust=0) +
    annotate("text", x=0.95, y=1.5,
             label="beta WGD\n~100 Mya", color="darkorange", size=3.5, hjust=0) +
    scale_x_continuous(limits=c(0,3), breaks=seq(0,3,0.5)) +
    labs(
        title    = "Ks distribution of Arabidopsis thaliana paralogs",
        subtitle = "Peaks at Ks peaks indicate ancient whole-genome duplication events",
        x        = "Ks (synonymous substitutions per synonymous site)",
        y        = "Density"
    ) +
    theme_bw(base_size=13)
ggsave("ks_distribution.png", width=9, height=5.5, dpi=150)
cat("Saved: ks_distribution.png\n")
REOF

Rscript plot_ks_distribution.R
Download:
📌
Tandem duplicates inflate the Ks signal! Tandem gene duplicates (adjacent in the genome, often very recent) create a broad low-Ks background that can mask or inflate the WGD peak. The solution: filter for syntenic paralogs (genes in collinear blocks from Step 3) before reporting a WGD. The Ks peak from syntenic paralogs is much sharper and more convincing than from all paralogs.

Step 3 — MCScanX Synteny Block Detection

Bash — MCScanX collinearity detection + visualisation setup
######################################################################
## STEP 3: MCScanX - detect synteny blocks (collinear genomic regions)
## A synteny block = chromosomal region where multiple gene pairs
## appear in the same relative order on two chromosomal segments.
## Within-genome collinear blocks = WGD remnants.
##
## Download: https://github.com/wyp1125/MCScanX
## Also works with MCScan (Python version: python-jcvi / jcvi package)
######################################################################

## Step 3a: Prepare GFF input for MCScanX
## MCScanX needs: chr(no spaces) TAB gene_id TAB start TAB end
## Arabidopsis GFF3 from TAIR:
wget "https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff" \
    -O TAIR10_genes.gff

## Convert GFF3 to MCScanX format
## Extract gene features, strip "Chr" prefix from chromosome names
awk '$3 == "gene" {
    chr = $1; sub(/^Chr/, "", chr)
    match($9, /ID=([^;]+)/, a)
    print chr "\t" a[1] "\t" $4 "\t" $5
}' TAIR10_genes.gff > Athaliana.gff

echo "Gene entries in GFF: $(wc -l < Athaliana.gff)"

## Step 3b: Rename BLAST file (MCScanX requires prefix.blast)
## Use the best-hit BLAST file from Step 1
cp Athaliana_best_pairs.txt Athaliana.blast

## Verify chromosome names match between GFF and BLAST
echo "Chromosomes in GFF:"
cut -f1 Athaliana.gff | sort -u
echo "Gene IDs in BLAST (first column, 5 examples):"
cut -f1 Athaliana.blast | head -5

## Step 3c: Run MCScanX
## -a : attempt to add unanchored genes into existing synteny blocks
## -e 1e-5 : BLAST E-value threshold
## -u 25   : max number of gaps (uncollinear genes) between anchor pairs
##   Increase -u for older WGDs (more fractionation = more gaps expected)
## -s 5    : minimum 5 collinear gene pairs per synteny block
##   Decrease to -s 3 for fragmented/highly fractionated genomes
MCScanX Athaliana \
    -a \
    -e 1e-5 \
    -u 25 \
    -s 5
## Output: Athaliana.collinearity (the main result file)

## Step 3d: Count synteny blocks
echo "Total synteny blocks:"
grep -c "^## Alignment" Athaliana.collinearity

## Step 3e: Extract only within-genome (intragenomic) blocks
## These are the WGD-derived duplicated segments
## Intragenomic = both chromosomes in the pair are from the same species
grep "^## Alignment" Athaliana.collinearity | \
    awk -F'[& ]' '{if ($5 == $6 || ($5 != $6 && $5 != "")) print}' | \
    head -20

## Step 3f: Build synteny visualisation input (for WGDI or Circos)
## MCScanX downstream tools directory has scripts for:
##   - Collinearity plot
##   - Venn diagram of shared synteny
##   - Chromosome number/gene count statistics
python3 MCScanX/downstream_analyses/detect_collinear_tandem_dispersed_proximal.py \
    -g Athaliana.gff \
    -b Athaliana.blast \
    -c Athaliana.collinearity \
    -o Athaliana_duplication_mode.txt
echo "Duplication mode classification:"
head -5 Athaliana_duplication_mode.txt
Download:
Synteny block parameters decoded
📌-u (gap tolerance)The maximum number of non-collinear genes allowed between two consecutive anchor pairs within a synteny block. Higher -u = more permissive; finds longer blocks but includes more noisy segments. For recent WGDs (Ks<0.3): use -u 10-15. For ancient WGDs (Ks>1.0) with heavy fractionation: use -u 25-50. Too high a value will create spuriously long blocks spanning unrelated chromosomal regions.
📌-s (minimum block size)Minimum number of collinear anchor pairs required for a block to be reported. -s 5 is the MCSCAN standard; it requires at least 5 homologous gene pairs in the correct order. Use -s 3 for very fragmented genomes (contig-level assemblies). Use -s 8-10 for high-confidence blocks only.

Step 4 — WGDI Collinearity Analysis & Dot Plots

Python — WGDI pipeline: collinearity + dot plot + Ks on syntenic pairs
######################################################################
## STEP 4: WGDI - a comprehensive WGD analysis suite
## WGDI (https://github.com/SunPengChuan/wgdi) integrates:
##   - Collinearity detection (improved MCScan algorithm)
##   - Syntenic Ks distribution (only syntenic paralog pairs)
##   - Dot plots (chromosome-scale synteny visualization)
##   - Subgenome analysis
##   - Collinearity depth (how many duplicates per region)
## Install: pip install wgdi
######################################################################

## Step 4a: Prepare WGDI input files
## WGDI needs a .conf file specifying all input paths and parameters.
## Required input files:
##   1. Genome annotation GFF (same as MCScanX)
##   2. Protein FASTA (same as Step 1)
##   3. BLAST output (same as Step 1)

## Create WGDI configuration file
cat > wgdi_collinearity.conf <<'CONF'
[dotplot]
blast = Athaliana_best_pairs.txt
gff   = Athaliana.gff
lens  = Athaliana_lens.txt        ; chromosome lengths file
genome1_name = Ath               ; label for x-axis
genome2_name = Ath               ; same genome for intragenomic
multiple = 1                     ; show top 1 synteny block per region
score = 100                      ; minimum BLAST bit score
e_value = 1e-5
repeat_number = 20               ; top N BLAST hits to consider
position = order                 ; plot by gene order (not Mbp position)
ancestor_left = none
ancestor_top = none
markersize = 0.5                 ; dot size in the plot
figsize = 10,10                  ; figure dimensions (inches)
savefile = Ath_vs_Ath_dotplot.pdf

[collinearity]
blast = Athaliana_best_pairs.txt
gff   = Athaliana.gff
lens  = Athaliana_lens.txt
multiple = 1
score = 100
e_value = 1e-5
repeat_number = 20
mg = 40,40                       ; maximum gap between anchor pairs (= MCScanX -u)
pvalue = 0.2
savefile = Athaliana_collinearity.txt

[ks]
cds_file = Athaliana_CDS.fna
pep_file = Athaliana_longest.faa
align_software = mafft            ; or muscle
pairs_file = Athaliana_collinearity.txt  ; use syntenic pairs only!
ks_file = Athaliana_syntenic_ks.txt
CONF

## Create chromosome lengths file (tab-separated: chr_name TAB gene_count)
## Count genes per chromosome from GFF
awk '{print $1}' Athaliana.gff | sort | uniq -c | \
    awk '{print $2 "\t" $1}' > Athaliana_lens.txt
echo "Chromosome lengths (gene counts):"
cat Athaliana_lens.txt

## Step 4b: Run WGDI dot plot
wgdi -d wgdi_collinearity.conf
## Output: Ath_vs_Ath_dotplot.pdf

## Step 4c: Run WGDI collinearity detection
wgdi -c wgdi_collinearity.conf
## Output: Athaliana_collinearity.txt

## Step 4d: Calculate Ks for SYNTENIC pairs only
## This is the key difference from all-paralog Ks: using only syntenic
## (collinear) pairs removes tandem duplicates and produces a much
## cleaner Ks distribution with sharper WGD peaks
wgdi -ks wgdi_collinearity.conf
## Output: Athaliana_syntenic_ks.txt

echo "Syntenic Ks pairs:"
wc -l Athaliana_syntenic_ks.txt
Download:

How to Interpret the Dot Plot

A dot plot between a genome and itself (intragenomic) reveals WGD history through its synteny pattern:

What you see in the dot plotWhat it means
Diagonal line (self-comparison identity line)Every chromosome aligned to itself — always present and not informative
Off-diagonal dot lines: each chromosome has 1 paralogous partner1 WGD event. Genome is diploidised paleotetraploid (2x2 pattern).
Each chromosome has 2 paralogous partners2 WGD events stacked. Diploidised paleohexaploid (e.g. core eudicot gamma triplication gives each region 3 copies → 2 off-diagonal partners).
Dots faint or broken across chromosomeFractionation (gene loss after WGD). Normal for ancient WGDs — not all synteny blocks are intact.
No off-diagonal syntenyNo WGD detected at the Ks range visible in the genome

Step 5 — Subgenome Phasing (Allopolyploids)

Python + Bash — SubPhaser for subgenome assignment in allopolyploids
######################################################################
## STEP 5: Subgenome phasing for allopolyploids
## For allopolyploids (e.g. Brassica napus = B. rapa AC + B. oleracea C),
## we want to assign each chromosome to its parental subgenome.
## This matters for:
##   - Understanding gene expression dominance (one subgenome is expressed more)
##   - Tracing which genes came from which ancestor
##   - Understanding fractionation bias (one subgenome retains more genes)
##
## Tool: SubPhaser (https://github.com/zhangrengang/SubPhaser)
##   Uses k-mer frequency differences between chromosomes to identify
##   which chromosomes share the same ancestral origin (same subgenome)
##
## Dataset: Brassica napus (canola) - known allotetraploid
##   An genome: 19 chromosomes from B. rapa ancestor
##   Cn genome: 19 chromosomes from B. oleracea ancestor
######################################################################

## Download B. napus genome (Darmor-bzh reference)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/686/985/GCA_000686985.1_Bnapus_v4.1/\
GCA_000686985.1_Bnapus_v4.1_genomic.fna.gz \
    -O Bnapus_Darmor.fna.gz
gunzip Bnapus_Darmor.fna.gz

## Install SubPhaser
pip install subphaser

## Run SubPhaser
## -i : genome FASTA
## -c : expected number of subgenomes (2 for allotetraploid)
## SubPhaser uses k-mer (k=15 by default) frequency profiles to cluster
## chromosomes into subgenome groups based on their k-mer similarity
## Chromosomes from the same ancestor have similar repetitive element profiles
subphaser \
    -i Bnapus_Darmor.fna \
    -c 2 \
    -k 15 \
    -o subphaser_output/
## Output: subphaser_output/subgenome_assignments.tsv
## Lists each chromosome with its assigned subgenome (A or C)

## ================================================================
## Alternative approach: Use diploid ancestor genomes to assign subgenomes
## ================================================================
## If you have sequenced the diploid ancestors (B. rapa and B. oleracea),
## you can directly assign B. napus chromosomes by synteny to each ancestor

## Step A: BLAST B. napus proteins against B. rapa (A genome ancestor)
makeblastdb -in Brapa_proteins.faa -dbtype prot -out Brapa_db
blastp -query Bnapus_proteins.faa -db Brapa_db \
    -evalue 1e-5 -max_target_seqs 1 -outfmt 6 \
    -out Bnapus_vs_Brapa.blast -num_threads 16

## Step B: BLAST B. napus proteins against B. oleracea (C genome ancestor)
makeblastdb -in Boleracea_proteins.faa -dbtype prot -out Boleracea_db
blastp -query Bnapus_proteins.faa -db Boleracea_db \
    -evalue 1e-5 -max_target_seqs 1 -outfmt 6 \
    -out Bnapus_vs_Boleracea.blast -num_threads 16

## Step C: For each B. napus gene, assign to whichever ancestor has better BLAST hit
python3 - <<'PYEOF'
import pandas as pd

cols = ["qseqid","sseqid","pident","length","mm","go","qs","qe","ss","se","evalue","bitscore"]
rapa = pd.read_csv("Bnapus_vs_Brapa.blast",    sep="\t", names=cols)
olr  = pd.read_csv("Bnapus_vs_Boleracea.blast", sep="\t", names=cols)

rapa_best = rapa.groupby("qseqid")["bitscore"].max().rename("score_A")
olr_best  = olr.groupby("qseqid")["bitscore"].max().rename("score_C")

df = pd.concat([rapa_best, olr_best], axis=1).fillna(0)
df["subgenome"] = df.apply(lambda r: "An" if r.score_A >= r.score_C else "Cn", axis=1)
df.to_csv("subgenome_gene_assignments.tsv", sep="\t")

# Count genes per subgenome
print(df["subgenome"].value_counts())
PYEOF
Download:

Step 6 — Fractionation Bias

After WGD, one copy of each duplicated gene pair is progressively lost through a process called gene fractionation. In allopolyploids, this loss is often biased toward one subgenome (the subgenome dominant in gene expression tends to retain more genes). Detecting fractionation bias is important evidence that your genome is allopolyploid (not autopolyploid, where fractionation is usually symmetrical).

R — Fractionation bias: gene retention per synteny block per subgenome
######################################################################
## STEP 6: Fractionation bias analysis
## For each synteny block (from Step 3/4), count how many genes were
## retained vs lost in each subgenome.
## Biased fractionation: one subgenome consistently retains more genes
## = evidence of allopolyploidy + subgenome dominance.
######################################################################

library(tidyverse)
library(ggplot2)

## Load synteny blocks from WGDI/MCScanX collinearity output
## And gene-to-subgenome assignments from Step 5
collinearity <- read.delim("Athaliana_collinearity.txt",
                            comment.char="#", header=FALSE,
                            col.names=c("block","i","gene1","gene2","score"))
subgenomes   <- read.delim("subgenome_gene_assignments.tsv")

## For each synteny block, calculate:
##   - Number of genes retained in subgenome 1 (An)
##   - Number of genes retained in subgenome 2 (Cn)
##   - Retention ratio (An / total)
block_stats <- collinearity %>%
    left_join(subgenomes %>% select(qseqid, subgenome),
              by=c("gene1"="qseqid")) %>%
    group_by(block, subgenome) %>%
    summarise(n_genes = n(), .groups="drop") %>%
    pivot_wider(names_from=subgenome, values_from=n_genes, values_fill=0) %>%
    mutate(
        total    = An + Cn,
        An_frac  = An / total,
        bias     = case_when(
            An_frac > 0.6 ~ "An dominant",
            An_frac < 0.4 ~ "Cn dominant",
            TRUE           ~ "Balanced"
        )
    )

cat("=== Fractionation bias summary ===\n")
print(table(block_stats$bias))

## Plot fractionation bias per synteny block
ggplot(block_stats, aes(x=An_frac, fill=bias)) +
    geom_histogram(bins=30, color="white") +
    geom_vline(xintercept=0.5, linetype=2, color="grey40") +
    scale_fill_manual(values=c(
        "An dominant"="steelblue", "Cn dominant"="tomato", "Balanced"="grey70")) +
    labs(
        title    = "Fractionation bias across synteny blocks",
        subtitle = "Shift from 0.5 indicates one subgenome retains more genes (biased fractionation)",
        x        = "Proportion of retained genes from An subgenome",
        y        = "Number of synteny blocks",
        fill     = NULL
    ) +
    theme_bw(base_size=13)
ggsave("fractionation_bias.png", width=8, height=5, dpi=150)

## Statistical test: is fractionation biased overall?
## One-sample t-test: is the mean An_frac significantly different from 0.5?
t_result <- t.test(block_stats$An_frac, mu=0.5)
cat("\n=== Fractionation bias test (H0: mean fraction = 0.5) ===\n")
cat("Mean An fraction:", round(mean(block_stats$An_frac), 3), "\n")
cat("p-value:", formatC(t_result$p.value, format="e", digits=3), "\n")
if (t_result$p.value < 0.05) {
    cat("CONCLUSION: Significant fractionation bias detected (p < 0.05)\n")
    cat("            Consistent with allopolyploidy and subgenome dominance.\n")
} else {
    cat("CONCLUSION: No significant fractionation bias (p >= 0.05)\n")
    cat("            Consistent with autopolyploidy or balanced diploidisation.\n")
}
Download:

Step 7 — Reporting WGD Evidence

A convincing WGD manuscript section requires convergent evidence from multiple independent lines. The table below is the standard checklist reviewers expect:

EvidenceWhat to reportTool used
Ks peak (all paralogs)Peak Ks value ± SD; number of pairs; background-subtracted peak heightwgd, KaKs_Calculator
Ks peak (syntenic only)Same as above but for collinear gene pairs only; much sharper signalwgd + MCScanX/WGDI collinearity
Estimated WGD ageX Mya ± CI; calibration rate used; calibration reference taxonKs ÷ 2r formula
Synteny blocksNumber of intragenomic collinear blocks; N genes in blocks; median block sizeMCScanX or WGDI
Dot plotFigure showing 1:1 or 1:2 chromosome correspondence; off-diagonal linesWGDI or MCScan
Collinearity depthWhat fraction of the genome is covered by duplicated synteny blocksMCScanX downstream analysis
Fractionation patternSymmetric (autopolyploidy) or asymmetric (allopolyploidy); p-value for bias testR fractionation analysis
Phylogenetic placementIs the WGD shared with sister lineages? (requires multi-species comparison)ASTRAL / species tree with WGD
🔎
The 3-evidence rule for WGD publication: Reviewers generally require at least three independent lines of evidence for a new WGD claim: (1) a clear Ks peak in syntenic paralogs, (2) a dot plot showing duplicated chromosomal blocks, and (3) genome-wide collinearity statistics. A Ks peak alone is insufficient because tandem duplicates and gene conversion can create spurious peaks. Synteny blocks are the strongest evidence because they cannot be explained by tandem duplication.
R — Multi-species Ks comparison to date and share a WGD event
######################################################################
## STEP 7: Multi-species Ks comparison
## To confirm a WGD is shared by a lineage (not species-specific),
## compare:
##   1. Paralog Ks within each species (peak = species WGD timing)
##   2. Ortholog Ks between species (= speciation time; calibration)
## If all species in a clade show the WGD peak at the same Ks value
## relative to their pairwise ortholog Ks, the WGD is shared.
######################################################################

library(ggplot2)
library(tidyverse)

## Load Ks values for several Brassicaceae species
## (paralog Ks from wgd; ortholog Ks from CodeML on one-to-one orthologs)
ks_ath   <- read.delim("Athaliana_syntenic_ks.txt")    %>% mutate(species="A. thaliana", type="Paralog")
ks_aly   <- read.delim("Alyrata_syntenic_ks.txt")      %>% mutate(species="A. lyrata",   type="Paralog")
ks_brap  <- read.delim("Brapa_syntenic_ks.txt")        %>% mutate(species="B. rapa",      type="Paralog")
## Ortholog Ks between A. thaliana and A. lyrata (= speciation time calibration)
ks_orth  <- read.delim("Ath_vs_Aly_ortholog_ks.txt")  %>% mutate(species="Ath-Aly orthologs", type="Ortholog")

all_ks <- bind_rows(ks_ath, ks_aly, ks_brap, ks_orth) %>%
    filter(Ks >= 0.05, Ks <= 3.0)

## Plot: overlaid Ks distributions for all species + ortholog calibration
ggplot(all_ks, aes(x=Ks, fill=species, color=species)) +
    geom_density(aes(linetype=type), alpha=0.3, linewidth=1) +
    ## Ortholog distribution peak = speciation time (calibration)
    scale_linetype_manual(values=c("Paralog"="solid", "Ortholog"="dashed")) +
    scale_x_continuous(limits=c(0, 3), breaks=seq(0, 3, 0.5)) +
    labs(
        title    = "Multi-species Ks comparison: Brassicaceae WGDs",
        subtitle = "Dashed line = A. thaliana - A. lyrata ortholog Ks (speciation calibration)",
        x        = "Ks",
        y        = "Density",
        fill     = "Species",
        color    = "Species",
        linetype = "Pair type"
    ) +
    theme_bw(base_size=13) +
    theme(legend.position="right")
ggsave("multispecies_ks_comparison.png", width=10, height=5.5, dpi=150)

## Interpretation guide:
## - If all three species show the SAME Ks peak (after normalising by ortholog Ks),
##   the WGD is shared by the common ancestor of all three species.
## - If only B. rapa shows an additional peak at lower Ks,
##   B. rapa has an additional, more recent WGD (the Brassiceae WGD).
cat("Multi-species Ks comparison saved: multispecies_ks_comparison.png\n")
Download: