Gene/Species Tree Discordance · Species Tree Methods · Phylogenetic Networks · Topology Testing · Long Branch Attraction
| Source | Mechanism | Signature | Detection Tool |
|---|---|---|---|
| ILS | Deep coalescence in ancestral populations | Random discordance, no directional bias | ASTRAL, concordance factors |
| Hybridization/Introgression | Gene flow between lineages | Directional excess of one discordant topology | D-statistics, PhyloNet, HyDe |
| Gene duplication/loss | Paralogs mistaken for orthologs | Unexpected deep divergences within a gene tree | ASTRAL-Pro, Notung, ALE |
| Horizontal gene transfer | Lateral acquisition (prokaryotes) | Gene in unexpected clade | Ranger-DTL, ALE |
| Gene tree estimation error | Short loci, poor alignment | Low bootstrap, short branches | Collapse low-support branches |
Concordance factors measure how much support the data actually provides for each branch, beyond what bootstrap says. gCF = fraction of gene trees supporting a branch. sCF = fraction of informative sites supporting a branch.
# Step 1: Build individual gene trees # (One FASTA per locus in a directory) iqtree2 -S loci_dir/ -m MFP -T AUTO --prefix loci_trees # Step 2: Compute concordance factors on the species tree iqtree2 -t species_tree.nwk --gcf loci_trees.treefile \ -s concat_alignment.phy --scf 100 \ -T AUTO --prefix concord # Output: concord.cf.tree # Each branch annotated with: gCF/sCF # Example: (A,(B,C)95/78/45); # 95 = bootstrap, 78 = gCF, 45 = sCF # Interpretation: # gCF = 78% means 78% of gene trees contain this branch # sCF = 45% means 45% of decisive sites support this branch # For a 3-taxon split, sCF of 33% = random (no signal) # sCF >> 33% = genuine signal supporting the branch
# DensiTree: overlay all gene trees semi-transparently # Install: included with BEAST2 densitree -b 0 gene_trees.nwk # Areas of agreement = solid dark lines # Areas of discordance = fuzzy, spread-out regions # DiscoVista: quantify and visualize discordance per clade # https://github.com/esayyari/DiscoVista python discoVista.py -a annotation_file.txt \ -t gene_trees_dir/ -m 75 \ -o discovista_output/ # -m 75: collapse branches with <75% bootstrap # Produces bar charts showing concordant vs discordant vs missing
"ILS is Random, Introgression is Directional" — IRID
Under ILS alone, the two discordant topologies are equally frequent (each ~⅓ at short internodes). If one discordant topology is significantly more common than the other, suspect introgression/hybridization. Test with D-statistics (ABBA-BABA).
| Approach | Tools | Input | Speed | Strengths |
|---|---|---|---|---|
| Summary (2-step) | ASTRAL-III, MP-EST | Pre-estimated gene trees | Very fast | Scalable to 1000s of loci, statistically consistent under MSC |
| Co-estimation | StarBEAST2, BEAST2+MSC | Sequence alignments | Very slow | Jointly estimates gene trees + species tree, uses full data |
| SNP-based | SVDquartets, SNAPP | SNP matrix | Moderate | No gene tree estimation step, works with unlinked SNPs |
| Concatenation | IQ-TREE, RAxML-NG | Concatenated alignment | Moderate | High power, but not statistically consistent under MSC |
# ASTRAL-III — standard species tree from single-copy orthologs java -jar astral.5.7.8.jar \ -i gene_trees.nwk \ -o species_tree.nwk \ -t 2 2> astral.log # -t 2: annotate with local posterior probability AND quartet support # Collapsing low-support branches in gene trees (recommended!) # Reduces gene tree estimation error propagation nw_ed gene_trees.nwk 'i & b<10' o > collapsed_trees.nwk java -jar astral.5.7.8.jar -i collapsed_trees.nwk -o species_tree.nwk # ASTRAL-Pro — handles multi-copy gene families (paralogs) java -jar astral-pro.jar \ -i gene_family_trees.nwk \ -o species_tree.nwk \ -u 2 # -u for multi-individual/multi-copy # Weighted ASTRAL (wASTRAL) — weights quartets by branch lengths java -jar wastral.jar -i gene_trees.nwk -o species_tree.nwk
# StarBEAST2 workflow (BEAST2 package): # 1. Install: in BEAST2 Package Manager → StarBEAST2 # 2. Open BEAUti, load multiple gene alignments # 3. Set up: # - Partitions tab: link/unlink trees per gene # - Taxon sets: map individuals → species # - Clock: relaxed uncorrelated lognormal (per gene) # - Tree prior: multispecies coalescent (species tree birth-death) # - Species tree prior: Yule or Birth-Death # 4. MCMC: ngen=500000000 (yes, 500M), logging every 50000 # 5. Run on HPC: beast -threads 16 -beagle starbeast2.xml # 6. Check convergence in Tracer (ESS > 200) # 7. Summarize: TreeAnnotator on species tree log
# SVDquartets in PAUP* execute alignment.nex; [Define species/populations] taxpartition species = SpeciesA: TaxonA1 TaxonA2 TaxonA3, SpeciesB: TaxonB1 TaxonB2, SpeciesC: TaxonC1 TaxonC2 TaxonC3; svdquartets evalQuartets=all taxpartition=species nthreads=16 bootstrap=standard nreps=100; savetrees file=svdq_tree.tre format=newick;
nw_ed from Newick Utilities or IQ-TREE's --con-tree.
When reticulate evolution (hybridization, introgression, HGT) is present, a tree is insufficient. Networks represent both vertical and horizontal inheritance.
# HyDe: fast test for hybridization using site pattern frequencies
# Tests if a taxon (Hybrid) is an admixture of P1 and P2
pip install phyde
# Command line
run_hyde.py -i alignment.phy -m map_file.txt -o outgroup \
-n num_individuals -t num_taxa -s num_sites
# Python API for all triplets
import phyde as gp
dat = gp.HydeData("alignment.phy", "map_file.txt", "outgroup",
num_ind, num_taxa, num_sites)
res = dat.bootstrap_triple("P1", "Hybrid", "P2", reps=1000)
# gamma (admixture proportion): 0 = pure P1, 1 = pure P2
# Significant p-value + gamma ∈ (0,1) = hybridization detected
using PhyloNetworks
# Read gene trees (concordance factor table from IQ-TREE or BUCKy)
cf_table = readTableCF("concordance_factors.csv")
# Start from a tree (e.g., ASTRAL species tree)
start_tree = readTopology("species_tree.nwk")
# Estimate network with 0 hybridizations (= tree)
net0 = snaq!(start_tree, cf_table, hmax=0, runs=10)
# Estimate with 1 hybridization
net1 = snaq!(net0, cf_table, hmax=1, runs=10)
# Estimate with 2 hybridizations
net2 = snaq!(net1, cf_table, hmax=2, runs=10)
# Compare models using log-pseudolikelihood
# Use the "elbow" method: when adding another h doesn't improve much
println("h=0: ", net0.loglik)
println("h=1: ", net1.loglik)
println("h=2: ", net2.loglik)
# Plot network
using PhyloPlots
plot(net1, showGamma=true) # gamma = inheritance proportion
# PhyloNet: InferNetwork_MPL (maximum pseudo-likelihood) #NEXUS BEGIN TREES; Tree gt1 = (A,(B,(C,D))); Tree gt2 = (A,(C,(B,D))); Tree gt3 = (A,(B,(C,D))); [... more gene trees ...] END; BEGIN PHYLONET; InferNetwork_MPL (gt1,gt2,gt3) 1 -x 10 -pl 16; [1 = max number of reticulations] [-x 10 = 10 independent runs] [-pl 16 = 16 threads] END;
# SplitsTree: quick visual check for reticulation # Input: alignment or distance matrix # GUI tool — load alignment → Distances → NeighborNet # Non-tree-like structure = possible hybridization/recombination # Command line (SplitsTree5): SplitsTree5 -i alignment.fasta -o network.nex
"Networks for Non-tree-like evolution"
If gene trees are discordant and the discordance is directional (not random ILS), you need a network, not a tree. Start with SplitsTree for visualization, HyDe for quick tests, SNaQ for formal network estimation.
| Test | Null Hypothesis | One-sided? | Recommended? |
|---|---|---|---|
| AU test | Tree is as good as the ML tree | No (multi-scale bootstrap) | Yes — gold standard |
| SH test | Tree is as good as the ML tree | No | Conservative, may fail to reject bad trees |
| KH test | Two trees are equally good | Only for pre-specified trees | Only for a priori hypotheses |
| SOWH test | Constrained tree is true | Parametric bootstrap | Most powerful, but slow |
| Bayes factors | Model comparison | N/A | Best in Bayesian framework |
# Step 1: Collect candidate trees (ML tree + alternatives) cat ml_tree.nwk alternative1.nwk alternative2.nwk > candidate_trees.nwk # Step 2: Run all topology tests iqtree2 -s alignment.phy -m GTR+G \ -z candidate_trees.nwk \ -zb 10000 -zw -au \ -T AUTO --prefix topotest # -zb 10000: 10000 RELL bootstrap replicates # -zw: weighted SH and KH tests # -au: AU test # Output: topotest.iqtree contains a table: # Tree logL deltaL bp-RELL p-KH p-SH p-WKH p-WSH c-ELW p-AU # 1 -45678 0 0.89 0.93 0.98 0.92 0.97 0.88 0.95 # 2 -45712 34 0.08 0.07 0.12 0.08 0.13 0.09 0.04* # 3 -45745 67 0.03 0.02* 0.04* 0.02* 0.04* 0.03 0.01* # * = rejected at p < 0.05 # Trees not rejected by AU test cannot be statistically distinguished
# IQ-TREE likelihood mapping (test phylogenetic signal) iqtree2 -s alignment.phy -m MFP -lmap 10000 -T AUTO # Output: .lmap.svg and .iqtree # Triangle plot: corners = fully resolved, center = unresolved # >90% in corners = strong signal # >30% in center = insufficient signal for this dataset # Four-cluster likelihood mapping (custom taxon groups): # Create a file defining 4 clusters: echo "ClusterA: TaxA1 TaxA2 TaxA3 ClusterB: TaxB1 TaxB2 ClusterC: TaxC1 TaxC2 ClusterD: TaxD1 TaxD2 TaxD3" > clusters.txt iqtree2 -s alignment.phy -m MFP -lmap 10000 \ -lmclust clusters.txt -T AUTO
Long Branch Attraction (LBA) = fast-evolving taxa artificially group together regardless of their true relationship. Compositional bias = taxa with similar base composition cluster together. Both are systematic errors that more data makes worse, not better.
Recoding amino acids into fewer categories reduces compositional signal and saturation at the cost of losing some phylogenetic information.
# Dayhoff6 recoding: 20 amino acids → 6 categories
# Group 1 (ASTGP): Small
# Group 2 (DNEQ): Acidic/amide
# Group 3 (RKH): Basic
# Group 4 (MVIL): Hydrophobic
# Group 5 (FYW): Aromatic
# Group 6 (C): Cysteine
# Python script for Dayhoff6 recoding:
python3 -c "
import sys
dayhoff6 = {
'A':'0','S':'0','T':'0','G':'0','P':'0',
'D':'1','N':'1','E':'1','Q':'1',
'R':'2','K':'2','H':'2',
'M':'3','V':'3','I':'3','L':'3',
'F':'4','Y':'4','W':'4',
'C':'5',
'-':'-','X':'?','?':'?'
}
for line in open(sys.argv[1]):
if line.startswith('>'):
print(line.strip())
else:
print(''.join(dayhoff6.get(c.upper(),'?') for c in line.strip()))
" alignment.fasta > alignment_dayhoff6.fasta
# SR4 recoding: 20 → 4 categories (even more aggressive)
# A/G/N/P/S/T → 0
# C/H/W/Y → 1
# D/E/K/Q/R → 2
# F/I/L/M/V → 3
# Run IQ-TREE on recoded data with appropriate model
iqtree2 -s alignment_dayhoff6.fasta -m GTR+G \
-st MORPH -T AUTO -bb 1000
# IQ-TREE: site rates estimation iqtree2 -s alignment.phy -m LG+G -T AUTO --rate # Output: .rate file with per-site rates # Remove fastest 10%, 20%, 30% of sites progressively # and re-run the tree each time # If the topology changes = fast sites were misleading # Slow-fast analysis in R: # Read rates, sort, remove top quantile, subset alignment # BaCoCa: batch composition checker # Tests each taxon's amino acid composition against average # https://github.com/simonwhelan/BaCoCa python BaCoCa.py -i alignment.fasta # Chi-squared p-value < 0.05 for a taxon = compositional outlier
"More data + Wrong model = More Wrong"
Systematic error is the opposite of stochastic error. Adding more genes with a misspecified model increases bootstrap for the WRONG tree. The fix is a better model, not more data.
LBA sensitivity analysis
If removing one long-branch taxon changes where another long-branch taxon falls in the tree, LBA is operating. Report both topologies and discuss. Use site-heterogeneous models and amino acid recoding as cross-checks.
ASTRAL handles polytomies natively (treats them as uncertainty). But if you want to resolve them first: collapse branches with low support (nw_ed trees.nwk 'i & b<10' o) to create genuine polytomies rather than poorly-resolved bifurcations.
The data may genuinely be tree-like (no hybridization). Or the signal-to-noise ratio is too low. Try: (1) use more gene trees, (2) ensure concordance factor table is computed correctly, (3) check that your starting tree is reasonable.
This means the ML tree is significantly better than all tested alternatives. This is a strong result — but verify your alternative trees are correctly specified (check rooting, taxon names, Newick format).
This is informative. It suggests that the original topology was influenced by compositional bias or saturation. The recoded result is typically more reliable for deep divergences, though it has less power (fewer character states). Report both and discuss which biological scenario each supports.
"Concatenate for Confidence, Coalescent for Correctness"
Concatenation gives high bootstrap (because of sheer data volume) but can be positively misleading under ILS. Coalescent methods (ASTRAL) are statistically consistent under the MSC. Always run both and compare.
"AU for All Uses" — the best topology test
The Approximately Unbiased test is the recommended topology test. SH is too conservative (rarely rejects). KH requires pre-specified hypotheses. The AU test correctly handles selection bias and works for any set of candidate trees.