WGD concepts explained · Ks plots · MCScanX synteny blocks · WGDI collinearity · dot plot interpretation · subgenome phasing · fractionation bias
Whole Genome Duplication (WGD) is an event where an organism's entire genome is copied — either by autopolyploidy (doubling within one species) or allopolyploidy (hybridisation between two species followed by chromosome doubling). WGD is ubiquitous in plants: virtually every flowering plant lineage has experienced at least one ancient WGD event. Understanding WGD history is fundamental for interpreting genome size, gene family expansion, and duplicate gene evolution.
| Evidence type | Tool | What it shows | Limitation |
|---|---|---|---|
| Ks distribution | wgd, KaKs_Calculator | Peak in synonymous substitution histogram = synchronised duplication burst = WGD | Ks saturates at >2.0; very old WGDs invisible |
| Synteny / collinearity | MCScanX, WGDI, SynMap2 | Duplicated chromosomal blocks within the same genome = remnant WGD segments | Gene loss (fractionation) erodes old blocks over time |
| Dot plot | WGDI, MCScan, jcvi | 2x2 synteny pattern = 1 WGD; 4x4 = 2 WGDs stacked | Meaningful only with chromosome-level assemblies |
| Chromosome count | Cytology | Multiples of base chromosome number x suggest polyploidy history | Diploidisation obscures history; not bioinformatic |
| Subgenome assignment | SubPhaser, ASTRAL, homeolog SNPs | Which chromosomes belong to which ancestral parental subgenome | Requires sufficient divergence between subgenomes |
Ks (synonymous substitution rate) counts the number of silent substitutions per synonymous site between two homologous coding sequences. Because synonymous mutations do not change the amino acid sequence, they accumulate at a roughly neutral, constant rate — acting as a molecular clock.
| Ks value | Approximate age (plants) | What to expect |
|---|---|---|
| 0.0 – 0.05 | Very recent (<3 Mya) | Tandem duplicates; very recent polyploidisation events |
| 0.05 – 0.3 | 3–20 Mya | Recent WGDs: e.g. legume WGD (~13 Mya), Brassiceae WGD (~7 Mya) |
| 0.3 – 1.0 | 20–60 Mya | Older WGDs: Arabidopsis α (~35 Mya), grass ρ (~70 Mya) |
| 1.0 – 2.0 | 60–100 Mya | Ks approaching saturation; core eudicot γ triplication (~100 Mya) |
| >2.0 | >100 Mya | Ks saturated — all synonymous sites may be multiple-hit; signal unreliable |
Use the live Ks calculator below to convert any Ks peak value to a WGD age estimate. Edit the Ks value and the substitution rate to see how the age changes — this makes the molecular clock formula completely transparent:
| 📊A sharp peak at a specific Ks value | = WGD signal. All duplicated gene pairs from a WGD event accumulated the same amount of synonymous substitutions since the duplication → they form a peak in the histogram. |
| 📊A shoulder or secondary peak | = evidence of an older WGD layered underneath a newer one (e.g. maize has peaks at Ks~0.05 for the recent maize WGD and Ks~1.5 for the ancient grass WGD). |
| 📊A broad flat distribution, no peak | = background of tandem/segmental duplicates; no WGD. Use synteny blocks to confirm. |
| ⏰Converting Ks to time | Time (Mya) = Ks ÷ (2 × r), where r = synonymous substitution rate per site per year. For plants: r ≈ 1.5–7 × 10−9 (highly variable). Always calibrate using paralog pairs with a known fossil/biogeographic date in your own lineage. |
######################################################################
## STEP 1: All-vs-all protein BLAST to identify all paralog pairs
## Self-BLAST: we BLAST every protein against every other protein
## in the same genome, then filter for likely paralogs.
##
## Dataset: Arabidopsis thaliana TAIR10
## Known WGDs: alpha (~35 Mya), beta (~100 Mya), gamma triplication (~140 Mya)
## 27,655 protein-coding genes; 5 chromosomes; 135 Mb
######################################################################
## Download Arabidopsis TAIR10 protein FASTA
wget "https://www.arabidopsis.org/download_files/Proteins/TAIR10_protein_lists/TAIR10_pep_20101214_updated" \
-O Athaliana_TAIR10.faa
## Step 1a: Keep only the longest isoform per gene
## Using all isoforms creates false duplicate pairs (same gene, different splice forms)
python3 - <<'PYEOF'
from Bio import SeqIO
seqs = {}
for rec in SeqIO.parse("Athaliana_TAIR10.faa", "fasta"):
gene = rec.id.split(".")[0] # gene ID = everything before first "."
if gene not in seqs or len(rec.seq) > len(seqs[gene].seq):
seqs[gene] = rec
with open("Athaliana_longest.faa", "w") as f:
SeqIO.write(seqs.values(), f, "fasta")
print(f"Wrote {len(seqs)} proteins (longest isoform per gene)")
PYEOF
## Step 1b: Build BLAST database
module load blast+/2.14.0
makeblastdb -in Athaliana_longest.faa -dbtype prot -out Athaliana_db
## Step 1c: Run self-BLAST (query == database)
## -max_target_seqs 20: keep top 20 hits per query
## WGD analysis does not need all distant homologs
## -perc_identity 40: 40% is permissive enough to catch ancient WGD pairs
## (very old paralogs from Ks~1.5 WGDs have ~50-60% identity at protein level)
blastp \
-query Athaliana_longest.faa \
-db Athaliana_db \
-evalue 1e-5 \
-num_threads 16 \
-max_target_seqs 20 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" \
-perc_identity 40 \
-out Athaliana_selfblast.txt
echo "Raw BLAST hits: $(wc -l < Athaliana_selfblast.txt)"
## Step 1d: Remove self-hits and keep best hit per gene pair
## awk '$1 != $2': remove lines where query == subject (self-hit)
## sort + awk '!seen[]': keep only the top-scoring hit per pair
awk '$1 != $2' Athaliana_selfblast.txt | \
sort -k1,1 -k2,2 -k12,12rn | \
awk '!seen[$1"_"$2]++' > Athaliana_best_pairs.txt
echo "Unique paralog pairs: $(wc -l < Athaliana_best_pairs.txt)"
######################################################################
## STEP 2: Calculate Ks values and identify WGD peaks
## Tool: wgd Python suite (https://github.com/arzwa/wgd)
## wgd handles: MCL clustering -> CDS alignment -> Ks calculation
## -> mixture model fitting to identify how many WGD peaks exist
## Install: pip install wgd
######################################################################
## Also download CDS sequences (required for synonymous rate calculation)
wget "https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_blastsets/TAIR10_cds_20101214_updated" \
-O Athaliana_CDS.fna
## Step A: MCL gene family clustering based on BLAST homology
## --inflation 1.5: MCL inflation parameter (higher = more, smaller families)
## --eval 1e-5: BLAST E-value cutoff
wgd dmd Athaliana_longest.faa \
--inflation 1.5 \
--eval 1e-5 \
--aln_tool mafft \
-o wgd_dmd/
## Step B: Calculate pairwise Ks within each gene family
## wgd ksd aligns each gene pair at the codon level and uses PAML yn00
## (or codeml) to estimate Ka and Ks from the codon alignment
wgd ksd wgd_dmd/*.mcl.tsv Athaliana_longest.faa \
--cds Athaliana_CDS.fna \
-o wgd_ks/
## Step C: Gaussian mixture model to decompose WGD peaks
## -n 1:5 = try 1 to 5 Gaussian components; AIC/BIC selects best
wgd mix wgd_ks/ks.tsv \
--n_range 1 5 \
-o wgd_mix/
## Outputs:
## wgd_ks/ks.tsv : all pairwise Ks values
## wgd_mix/*.pdf : Ks plot with fitted mixture components
## wgd_mix/*.tsv : fitted component means (= WGD peak Ks values)
## ============================================================
## R: Custom Ks distribution plot with WGD annotations
## ============================================================
cat <<'REOF' > plot_ks_distribution.R
library(ggplot2)
library(mclust) # Gaussian mixture model fitting
library(tidyverse)
## Load Ks values
ks_raw <- read.delim("wgd_ks/ks.tsv", header=TRUE)
## Quality filters:
## - Ks >= 0.05: exclude very recent/tandem duplicates (Ks near 0 is unreliable)
## - Ks <= 3.0: exclude saturated pairs (multiple hits at synonymous sites)
## - Ka/Ks <= 1: exclude positively selected pairs (Ks underestimated under positive selection)
ks_filt <- ks_raw %>%
filter(Ks >= 0.05, Ks <= 3.0) %>%
filter(is.na(Ka) | (Ka/Ks) <= 1.0)
cat("Paralog pairs after filtering:", nrow(ks_filt), "\n")
## Fit Gaussian mixture to log(Ks)
## Working in log space because Ks distributions are right-skewed
set.seed(42)
bic_fit <- mclustBIC(log(ks_filt$Ks), G=1:5, modelNames="V")
best_fit <- Mclust(log(ks_filt$Ks), x=bic_fit)
cat("Best number of components:", best_fit$G, "\n")
## Report WGD peaks
peak_ks <- sort(exp(best_fit$parameters$mean))
props <- best_fit$parameters$pro[order(best_fit$parameters$mean)]
cat("\nKs peaks (WGD candidates):\n")
for (i in seq_along(peak_ks)) {
## Convert Ks to approximate age using plant synonymous rate
## r = 1.5e-8 substitutions/site/year (Arabidopsis-calibrated)
r_plant <- 1.5e-8
age_Mya <- peak_ks[i] / (2 * r_plant) / 1e6
cat(sprintf(" Peak %d: Ks=%.3f (~%.0f Mya) %.1f%% of pairs\n",
i, peak_ks[i], age_Mya, props[i]*100))
}
## Build density data for plot
dens <- density(ks_filt$Ks, bw=0.05, from=0, to=3)
## Custom plot
ggplot(ks_filt, aes(x=Ks)) +
geom_histogram(aes(y=after_stat(density)),
bins=120, fill="steelblue", color=NA, alpha=0.65) +
geom_line(data=data.frame(x=dens$x, y=dens$y),
aes(x=x, y=y), color="navy", linewidth=1) +
## Annotate WGD peaks (Arabidopsis-specific)
geom_vline(xintercept=0.15, color="firebrick", linetype=2, linewidth=0.9) +
geom_vline(xintercept=0.90, color="darkorange", linetype=2, linewidth=0.9) +
annotate("text", x=0.20, y=2.8,
label="alpha WGD\n~35 Mya", color="firebrick", size=3.5, hjust=0) +
annotate("text", x=0.95, y=1.5,
label="beta WGD\n~100 Mya", color="darkorange", size=3.5, hjust=0) +
scale_x_continuous(limits=c(0,3), breaks=seq(0,3,0.5)) +
labs(
title = "Ks distribution of Arabidopsis thaliana paralogs",
subtitle = "Peaks at Ks peaks indicate ancient whole-genome duplication events",
x = "Ks (synonymous substitutions per synonymous site)",
y = "Density"
) +
theme_bw(base_size=13)
ggsave("ks_distribution.png", width=9, height=5.5, dpi=150)
cat("Saved: ks_distribution.png\n")
REOF
Rscript plot_ks_distribution.R
######################################################################
## STEP 3: MCScanX - detect synteny blocks (collinear genomic regions)
## A synteny block = chromosomal region where multiple gene pairs
## appear in the same relative order on two chromosomal segments.
## Within-genome collinear blocks = WGD remnants.
##
## Download: https://github.com/wyp1125/MCScanX
## Also works with MCScan (Python version: python-jcvi / jcvi package)
######################################################################
## Step 3a: Prepare GFF input for MCScanX
## MCScanX needs: chr(no spaces) TAB gene_id TAB start TAB end
## Arabidopsis GFF3 from TAIR:
wget "https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff" \
-O TAIR10_genes.gff
## Convert GFF3 to MCScanX format
## Extract gene features, strip "Chr" prefix from chromosome names
awk '$3 == "gene" {
chr = $1; sub(/^Chr/, "", chr)
match($9, /ID=([^;]+)/, a)
print chr "\t" a[1] "\t" $4 "\t" $5
}' TAIR10_genes.gff > Athaliana.gff
echo "Gene entries in GFF: $(wc -l < Athaliana.gff)"
## Step 3b: Rename BLAST file (MCScanX requires prefix.blast)
## Use the best-hit BLAST file from Step 1
cp Athaliana_best_pairs.txt Athaliana.blast
## Verify chromosome names match between GFF and BLAST
echo "Chromosomes in GFF:"
cut -f1 Athaliana.gff | sort -u
echo "Gene IDs in BLAST (first column, 5 examples):"
cut -f1 Athaliana.blast | head -5
## Step 3c: Run MCScanX
## -a : attempt to add unanchored genes into existing synteny blocks
## -e 1e-5 : BLAST E-value threshold
## -u 25 : max number of gaps (uncollinear genes) between anchor pairs
## Increase -u for older WGDs (more fractionation = more gaps expected)
## -s 5 : minimum 5 collinear gene pairs per synteny block
## Decrease to -s 3 for fragmented/highly fractionated genomes
MCScanX Athaliana \
-a \
-e 1e-5 \
-u 25 \
-s 5
## Output: Athaliana.collinearity (the main result file)
## Step 3d: Count synteny blocks
echo "Total synteny blocks:"
grep -c "^## Alignment" Athaliana.collinearity
## Step 3e: Extract only within-genome (intragenomic) blocks
## These are the WGD-derived duplicated segments
## Intragenomic = both chromosomes in the pair are from the same species
grep "^## Alignment" Athaliana.collinearity | \
awk -F'[& ]' '{if ($5 == $6 || ($5 != $6 && $5 != "")) print}' | \
head -20
## Step 3f: Build synteny visualisation input (for WGDI or Circos)
## MCScanX downstream tools directory has scripts for:
## - Collinearity plot
## - Venn diagram of shared synteny
## - Chromosome number/gene count statistics
python3 MCScanX/downstream_analyses/detect_collinear_tandem_dispersed_proximal.py \
-g Athaliana.gff \
-b Athaliana.blast \
-c Athaliana.collinearity \
-o Athaliana_duplication_mode.txt
echo "Duplication mode classification:"
head -5 Athaliana_duplication_mode.txt
| 📌-u (gap tolerance) | The maximum number of non-collinear genes allowed between two consecutive anchor pairs within a synteny block. Higher -u = more permissive; finds longer blocks but includes more noisy segments. For recent WGDs (Ks<0.3): use -u 10-15. For ancient WGDs (Ks>1.0) with heavy fractionation: use -u 25-50. Too high a value will create spuriously long blocks spanning unrelated chromosomal regions. |
| 📌-s (minimum block size) | Minimum number of collinear anchor pairs required for a block to be reported. -s 5 is the MCSCAN standard; it requires at least 5 homologous gene pairs in the correct order. Use -s 3 for very fragmented genomes (contig-level assemblies). Use -s 8-10 for high-confidence blocks only. |
######################################################################
## STEP 4: WGDI - a comprehensive WGD analysis suite
## WGDI (https://github.com/SunPengChuan/wgdi) integrates:
## - Collinearity detection (improved MCScan algorithm)
## - Syntenic Ks distribution (only syntenic paralog pairs)
## - Dot plots (chromosome-scale synteny visualization)
## - Subgenome analysis
## - Collinearity depth (how many duplicates per region)
## Install: pip install wgdi
######################################################################
## Step 4a: Prepare WGDI input files
## WGDI needs a .conf file specifying all input paths and parameters.
## Required input files:
## 1. Genome annotation GFF (same as MCScanX)
## 2. Protein FASTA (same as Step 1)
## 3. BLAST output (same as Step 1)
## Create WGDI configuration file
cat > wgdi_collinearity.conf <<'CONF'
[dotplot]
blast = Athaliana_best_pairs.txt
gff = Athaliana.gff
lens = Athaliana_lens.txt ; chromosome lengths file
genome1_name = Ath ; label for x-axis
genome2_name = Ath ; same genome for intragenomic
multiple = 1 ; show top 1 synteny block per region
score = 100 ; minimum BLAST bit score
e_value = 1e-5
repeat_number = 20 ; top N BLAST hits to consider
position = order ; plot by gene order (not Mbp position)
ancestor_left = none
ancestor_top = none
markersize = 0.5 ; dot size in the plot
figsize = 10,10 ; figure dimensions (inches)
savefile = Ath_vs_Ath_dotplot.pdf
[collinearity]
blast = Athaliana_best_pairs.txt
gff = Athaliana.gff
lens = Athaliana_lens.txt
multiple = 1
score = 100
e_value = 1e-5
repeat_number = 20
mg = 40,40 ; maximum gap between anchor pairs (= MCScanX -u)
pvalue = 0.2
savefile = Athaliana_collinearity.txt
[ks]
cds_file = Athaliana_CDS.fna
pep_file = Athaliana_longest.faa
align_software = mafft ; or muscle
pairs_file = Athaliana_collinearity.txt ; use syntenic pairs only!
ks_file = Athaliana_syntenic_ks.txt
CONF
## Create chromosome lengths file (tab-separated: chr_name TAB gene_count)
## Count genes per chromosome from GFF
awk '{print $1}' Athaliana.gff | sort | uniq -c | \
awk '{print $2 "\t" $1}' > Athaliana_lens.txt
echo "Chromosome lengths (gene counts):"
cat Athaliana_lens.txt
## Step 4b: Run WGDI dot plot
wgdi -d wgdi_collinearity.conf
## Output: Ath_vs_Ath_dotplot.pdf
## Step 4c: Run WGDI collinearity detection
wgdi -c wgdi_collinearity.conf
## Output: Athaliana_collinearity.txt
## Step 4d: Calculate Ks for SYNTENIC pairs only
## This is the key difference from all-paralog Ks: using only syntenic
## (collinear) pairs removes tandem duplicates and produces a much
## cleaner Ks distribution with sharper WGD peaks
wgdi -ks wgdi_collinearity.conf
## Output: Athaliana_syntenic_ks.txt
echo "Syntenic Ks pairs:"
wc -l Athaliana_syntenic_ks.txt
A dot plot between a genome and itself (intragenomic) reveals WGD history through its synteny pattern:
| What you see in the dot plot | What it means |
|---|---|
| Diagonal line (self-comparison identity line) | Every chromosome aligned to itself — always present and not informative |
| Off-diagonal dot lines: each chromosome has 1 paralogous partner | 1 WGD event. Genome is diploidised paleotetraploid (2x2 pattern). |
| Each chromosome has 2 paralogous partners | 2 WGD events stacked. Diploidised paleohexaploid (e.g. core eudicot gamma triplication gives each region 3 copies → 2 off-diagonal partners). |
| Dots faint or broken across chromosome | Fractionation (gene loss after WGD). Normal for ancient WGDs — not all synteny blocks are intact. |
| No off-diagonal synteny | No WGD detected at the Ks range visible in the genome |
######################################################################
## STEP 5: Subgenome phasing for allopolyploids
## For allopolyploids (e.g. Brassica napus = B. rapa AC + B. oleracea C),
## we want to assign each chromosome to its parental subgenome.
## This matters for:
## - Understanding gene expression dominance (one subgenome is expressed more)
## - Tracing which genes came from which ancestor
## - Understanding fractionation bias (one subgenome retains more genes)
##
## Tool: SubPhaser (https://github.com/zhangrengang/SubPhaser)
## Uses k-mer frequency differences between chromosomes to identify
## which chromosomes share the same ancestral origin (same subgenome)
##
## Dataset: Brassica napus (canola) - known allotetraploid
## An genome: 19 chromosomes from B. rapa ancestor
## Cn genome: 19 chromosomes from B. oleracea ancestor
######################################################################
## Download B. napus genome (Darmor-bzh reference)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/686/985/GCA_000686985.1_Bnapus_v4.1/\
GCA_000686985.1_Bnapus_v4.1_genomic.fna.gz \
-O Bnapus_Darmor.fna.gz
gunzip Bnapus_Darmor.fna.gz
## Install SubPhaser
pip install subphaser
## Run SubPhaser
## -i : genome FASTA
## -c : expected number of subgenomes (2 for allotetraploid)
## SubPhaser uses k-mer (k=15 by default) frequency profiles to cluster
## chromosomes into subgenome groups based on their k-mer similarity
## Chromosomes from the same ancestor have similar repetitive element profiles
subphaser \
-i Bnapus_Darmor.fna \
-c 2 \
-k 15 \
-o subphaser_output/
## Output: subphaser_output/subgenome_assignments.tsv
## Lists each chromosome with its assigned subgenome (A or C)
## ================================================================
## Alternative approach: Use diploid ancestor genomes to assign subgenomes
## ================================================================
## If you have sequenced the diploid ancestors (B. rapa and B. oleracea),
## you can directly assign B. napus chromosomes by synteny to each ancestor
## Step A: BLAST B. napus proteins against B. rapa (A genome ancestor)
makeblastdb -in Brapa_proteins.faa -dbtype prot -out Brapa_db
blastp -query Bnapus_proteins.faa -db Brapa_db \
-evalue 1e-5 -max_target_seqs 1 -outfmt 6 \
-out Bnapus_vs_Brapa.blast -num_threads 16
## Step B: BLAST B. napus proteins against B. oleracea (C genome ancestor)
makeblastdb -in Boleracea_proteins.faa -dbtype prot -out Boleracea_db
blastp -query Bnapus_proteins.faa -db Boleracea_db \
-evalue 1e-5 -max_target_seqs 1 -outfmt 6 \
-out Bnapus_vs_Boleracea.blast -num_threads 16
## Step C: For each B. napus gene, assign to whichever ancestor has better BLAST hit
python3 - <<'PYEOF'
import pandas as pd
cols = ["qseqid","sseqid","pident","length","mm","go","qs","qe","ss","se","evalue","bitscore"]
rapa = pd.read_csv("Bnapus_vs_Brapa.blast", sep="\t", names=cols)
olr = pd.read_csv("Bnapus_vs_Boleracea.blast", sep="\t", names=cols)
rapa_best = rapa.groupby("qseqid")["bitscore"].max().rename("score_A")
olr_best = olr.groupby("qseqid")["bitscore"].max().rename("score_C")
df = pd.concat([rapa_best, olr_best], axis=1).fillna(0)
df["subgenome"] = df.apply(lambda r: "An" if r.score_A >= r.score_C else "Cn", axis=1)
df.to_csv("subgenome_gene_assignments.tsv", sep="\t")
# Count genes per subgenome
print(df["subgenome"].value_counts())
PYEOF
After WGD, one copy of each duplicated gene pair is progressively lost through a process called gene fractionation. In allopolyploids, this loss is often biased toward one subgenome (the subgenome dominant in gene expression tends to retain more genes). Detecting fractionation bias is important evidence that your genome is allopolyploid (not autopolyploid, where fractionation is usually symmetrical).
######################################################################
## STEP 6: Fractionation bias analysis
## For each synteny block (from Step 3/4), count how many genes were
## retained vs lost in each subgenome.
## Biased fractionation: one subgenome consistently retains more genes
## = evidence of allopolyploidy + subgenome dominance.
######################################################################
library(tidyverse)
library(ggplot2)
## Load synteny blocks from WGDI/MCScanX collinearity output
## And gene-to-subgenome assignments from Step 5
collinearity <- read.delim("Athaliana_collinearity.txt",
comment.char="#", header=FALSE,
col.names=c("block","i","gene1","gene2","score"))
subgenomes <- read.delim("subgenome_gene_assignments.tsv")
## For each synteny block, calculate:
## - Number of genes retained in subgenome 1 (An)
## - Number of genes retained in subgenome 2 (Cn)
## - Retention ratio (An / total)
block_stats <- collinearity %>%
left_join(subgenomes %>% select(qseqid, subgenome),
by=c("gene1"="qseqid")) %>%
group_by(block, subgenome) %>%
summarise(n_genes = n(), .groups="drop") %>%
pivot_wider(names_from=subgenome, values_from=n_genes, values_fill=0) %>%
mutate(
total = An + Cn,
An_frac = An / total,
bias = case_when(
An_frac > 0.6 ~ "An dominant",
An_frac < 0.4 ~ "Cn dominant",
TRUE ~ "Balanced"
)
)
cat("=== Fractionation bias summary ===\n")
print(table(block_stats$bias))
## Plot fractionation bias per synteny block
ggplot(block_stats, aes(x=An_frac, fill=bias)) +
geom_histogram(bins=30, color="white") +
geom_vline(xintercept=0.5, linetype=2, color="grey40") +
scale_fill_manual(values=c(
"An dominant"="steelblue", "Cn dominant"="tomato", "Balanced"="grey70")) +
labs(
title = "Fractionation bias across synteny blocks",
subtitle = "Shift from 0.5 indicates one subgenome retains more genes (biased fractionation)",
x = "Proportion of retained genes from An subgenome",
y = "Number of synteny blocks",
fill = NULL
) +
theme_bw(base_size=13)
ggsave("fractionation_bias.png", width=8, height=5, dpi=150)
## Statistical test: is fractionation biased overall?
## One-sample t-test: is the mean An_frac significantly different from 0.5?
t_result <- t.test(block_stats$An_frac, mu=0.5)
cat("\n=== Fractionation bias test (H0: mean fraction = 0.5) ===\n")
cat("Mean An fraction:", round(mean(block_stats$An_frac), 3), "\n")
cat("p-value:", formatC(t_result$p.value, format="e", digits=3), "\n")
if (t_result$p.value < 0.05) {
cat("CONCLUSION: Significant fractionation bias detected (p < 0.05)\n")
cat(" Consistent with allopolyploidy and subgenome dominance.\n")
} else {
cat("CONCLUSION: No significant fractionation bias (p >= 0.05)\n")
cat(" Consistent with autopolyploidy or balanced diploidisation.\n")
}
A convincing WGD manuscript section requires convergent evidence from multiple independent lines. The table below is the standard checklist reviewers expect:
| Evidence | What to report | Tool used |
|---|---|---|
| Ks peak (all paralogs) | Peak Ks value ± SD; number of pairs; background-subtracted peak height | wgd, KaKs_Calculator |
| Ks peak (syntenic only) | Same as above but for collinear gene pairs only; much sharper signal | wgd + MCScanX/WGDI collinearity |
| Estimated WGD age | X Mya ± CI; calibration rate used; calibration reference taxon | Ks ÷ 2r formula |
| Synteny blocks | Number of intragenomic collinear blocks; N genes in blocks; median block size | MCScanX or WGDI |
| Dot plot | Figure showing 1:1 or 1:2 chromosome correspondence; off-diagonal lines | WGDI or MCScan |
| Collinearity depth | What fraction of the genome is covered by duplicated synteny blocks | MCScanX downstream analysis |
| Fractionation pattern | Symmetric (autopolyploidy) or asymmetric (allopolyploidy); p-value for bias test | R fractionation analysis |
| Phylogenetic placement | Is the WGD shared with sister lineages? (requires multi-species comparison) | ASTRAL / species tree with WGD |
######################################################################
## STEP 7: Multi-species Ks comparison
## To confirm a WGD is shared by a lineage (not species-specific),
## compare:
## 1. Paralog Ks within each species (peak = species WGD timing)
## 2. Ortholog Ks between species (= speciation time; calibration)
## If all species in a clade show the WGD peak at the same Ks value
## relative to their pairwise ortholog Ks, the WGD is shared.
######################################################################
library(ggplot2)
library(tidyverse)
## Load Ks values for several Brassicaceae species
## (paralog Ks from wgd; ortholog Ks from CodeML on one-to-one orthologs)
ks_ath <- read.delim("Athaliana_syntenic_ks.txt") %>% mutate(species="A. thaliana", type="Paralog")
ks_aly <- read.delim("Alyrata_syntenic_ks.txt") %>% mutate(species="A. lyrata", type="Paralog")
ks_brap <- read.delim("Brapa_syntenic_ks.txt") %>% mutate(species="B. rapa", type="Paralog")
## Ortholog Ks between A. thaliana and A. lyrata (= speciation time calibration)
ks_orth <- read.delim("Ath_vs_Aly_ortholog_ks.txt") %>% mutate(species="Ath-Aly orthologs", type="Ortholog")
all_ks <- bind_rows(ks_ath, ks_aly, ks_brap, ks_orth) %>%
filter(Ks >= 0.05, Ks <= 3.0)
## Plot: overlaid Ks distributions for all species + ortholog calibration
ggplot(all_ks, aes(x=Ks, fill=species, color=species)) +
geom_density(aes(linetype=type), alpha=0.3, linewidth=1) +
## Ortholog distribution peak = speciation time (calibration)
scale_linetype_manual(values=c("Paralog"="solid", "Ortholog"="dashed")) +
scale_x_continuous(limits=c(0, 3), breaks=seq(0, 3, 0.5)) +
labs(
title = "Multi-species Ks comparison: Brassicaceae WGDs",
subtitle = "Dashed line = A. thaliana - A. lyrata ortholog Ks (speciation calibration)",
x = "Ks",
y = "Density",
fill = "Species",
color = "Species",
linetype = "Pair type"
) +
theme_bw(base_size=13) +
theme(legend.position="right")
ggsave("multispecies_ks_comparison.png", width=10, height=5.5, dpi=150)
## Interpretation guide:
## - If all three species show the SAME Ks peak (after normalising by ortholog Ks),
## the WGD is shared by the common ancestor of all three species.
## - If only B. rapa shows an additional peak at lower Ks,
## B. rapa has an additional, more recent WGD (the Brassiceae WGD).
cat("Multi-species Ks comparison saved: multispecies_ks_comparison.png\n")