Phylogenetic Inference — Deep Dive Module 2

Gene/Species Tree Discordance · Species Tree Methods · Phylogenetic Networks · Topology Testing · Long Branch Attraction

~150 min Advanced Bash / R / Julia
Byte

1.4 — Gene Tree / Species Tree Discordance

Sources of Discordance

SourceMechanismSignatureDetection Tool
ILSDeep coalescence in ancestral populationsRandom discordance, no directional biasASTRAL, concordance factors
Hybridization/IntrogressionGene flow between lineagesDirectional excess of one discordant topologyD-statistics, PhyloNet, HyDe
Gene duplication/lossParalogs mistaken for orthologsUnexpected deep divergences within a gene treeASTRAL-Pro, Notung, ALE
Horizontal gene transferLateral acquisition (prokaryotes)Gene in unexpected cladeRanger-DTL, ALE
Gene tree estimation errorShort loci, poor alignmentLow bootstrap, short branchesCollapse low-support branches

Concordance Factors in IQ-TREE

Concordance factors measure how much support the data actually provides for each branch, beyond what bootstrap says. gCF = fraction of gene trees supporting a branch. sCF = fraction of informative sites supporting a branch.

Bash
# Step 1: Build individual gene trees
# (One FASTA per locus in a directory)
iqtree2 -S loci_dir/ -m MFP -T AUTO --prefix loci_trees

# Step 2: Compute concordance factors on the species tree
iqtree2 -t species_tree.nwk --gcf loci_trees.treefile \
  -s concat_alignment.phy --scf 100 \
  -T AUTO --prefix concord

# Output: concord.cf.tree
# Each branch annotated with: gCF/sCF
# Example: (A,(B,C)95/78/45);
#   95 = bootstrap, 78 = gCF, 45 = sCF

# Interpretation:
# gCF = 78% means 78% of gene trees contain this branch
# sCF = 45% means 45% of decisive sites support this branch
# For a 3-taxon split, sCF of 33% = random (no signal)
# sCF >> 33% = genuine signal supporting the branch
Download:
Thumb Rule — Reading Concordance Factors:
gCF ≥ 60% = strongly supported by gene trees
gCF 33–60% = conflicted (ILS or hybridization zone)
gCF ~ 33% = no resolution (three topologies equally frequent = ILS at short internodes)
sCF ~ 33% = genuinely uninformative (too few sites)

DensiTree & DiscoVista for Discordance Visualization

Bash
# DensiTree: overlay all gene trees semi-transparently
# Install: included with BEAST2
densitree -b 0 gene_trees.nwk
# Areas of agreement = solid dark lines
# Areas of discordance = fuzzy, spread-out regions

# DiscoVista: quantify and visualize discordance per clade
# https://github.com/esayyari/DiscoVista
python discoVista.py -a annotation_file.txt \
  -t gene_trees_dir/ -m 75 \
  -o discovista_output/
# -m 75: collapse branches with <75% bootstrap
# Produces bar charts showing concordant vs discordant vs missing
Mnemonic

"ILS is Random, Introgression is Directional" — IRID

Under ILS alone, the two discordant topologies are equally frequent (each ~⅓ at short internodes). If one discordant topology is significantly more common than the other, suspect introgression/hybridization. Test with D-statistics (ABBA-BABA).

1.5 — Species Tree Methods

Summary vs Co-Estimation Methods

ApproachToolsInputSpeedStrengths
Summary (2-step)ASTRAL-III, MP-ESTPre-estimated gene treesVery fastScalable to 1000s of loci, statistically consistent under MSC
Co-estimationStarBEAST2, BEAST2+MSCSequence alignmentsVery slowJointly estimates gene trees + species tree, uses full data
SNP-basedSVDquartets, SNAPPSNP matrixModerateNo gene tree estimation step, works with unlinked SNPs
ConcatenationIQ-TREE, RAxML-NGConcatenated alignmentModerateHigh power, but not statistically consistent under MSC

ASTRAL-III & ASTRAL-Pro

Bash
# ASTRAL-III — standard species tree from single-copy orthologs
java -jar astral.5.7.8.jar \
  -i gene_trees.nwk \
  -o species_tree.nwk \
  -t 2 2> astral.log
# -t 2: annotate with local posterior probability AND quartet support

# Collapsing low-support branches in gene trees (recommended!)
# Reduces gene tree estimation error propagation
nw_ed gene_trees.nwk 'i & b<10' o > collapsed_trees.nwk
java -jar astral.5.7.8.jar -i collapsed_trees.nwk -o species_tree.nwk

# ASTRAL-Pro — handles multi-copy gene families (paralogs)
java -jar astral-pro.jar \
  -i gene_family_trees.nwk \
  -o species_tree.nwk \
  -u 2  # -u for multi-individual/multi-copy

# Weighted ASTRAL (wASTRAL) — weights quartets by branch lengths
java -jar wastral.jar -i gene_trees.nwk -o species_tree.nwk
Download:

StarBEAST2 — Co-Estimation Under the MSC

Workflow
# StarBEAST2 workflow (BEAST2 package):
# 1. Install: in BEAST2 Package Manager → StarBEAST2
# 2. Open BEAUti, load multiple gene alignments
# 3. Set up:
#    - Partitions tab: link/unlink trees per gene
#    - Taxon sets: map individuals → species
#    - Clock: relaxed uncorrelated lognormal (per gene)
#    - Tree prior: multispecies coalescent (species tree birth-death)
#    - Species tree prior: Yule or Birth-Death
# 4. MCMC: ngen=500000000 (yes, 500M), logging every 50000
# 5. Run on HPC:
beast -threads 16 -beagle starbeast2.xml
# 6. Check convergence in Tracer (ESS > 200)
# 7. Summarize: TreeAnnotator on species tree log

SVDquartets — SNP-Based Species Tree

PAUP*
# SVDquartets in PAUP*
execute alignment.nex;

[Define species/populations]
taxpartition species =
  SpeciesA: TaxonA1 TaxonA2 TaxonA3,
  SpeciesB: TaxonB1 TaxonB2,
  SpeciesC: TaxonC1 TaxonC2 TaxonC3;

svdquartets evalQuartets=all taxpartition=species
  nthreads=16 bootstrap=standard nreps=100;
savetrees file=svdq_tree.tre format=newick;
Download:
When to Concatenate vs Use Coalescent Methods
  • Concatenation works when: internodes are long (low ILS), species are closely related, loci have similar rates
  • Coalescent methods needed when: rapid radiation (short internodes), known hybridization, gene trees show substantial conflict
  • Both agree: great, high confidence in the result
  • They disagree: investigate further — compute concordance factors, test for introgression
Thumb Rule — Gene Trees for ASTRAL: Collapse branches with bootstrap <10% in gene trees before running ASTRAL. This prevents erroneous gene tree resolutions from biasing the species tree. Use nw_ed from Newick Utilities or IQ-TREE's --con-tree.

1.6 — Phylogenetic Networks

When reticulate evolution (hybridization, introgression, HGT) is present, a tree is insufficient. Networks represent both vertical and horizontal inheritance.

HyDe — Hybridization Detection

Bash / Python
# HyDe: fast test for hybridization using site pattern frequencies
# Tests if a taxon (Hybrid) is an admixture of P1 and P2
pip install phyde

# Command line
run_hyde.py -i alignment.phy -m map_file.txt -o outgroup \
  -n num_individuals -t num_taxa -s num_sites

# Python API for all triplets
import phyde as gp
dat = gp.HydeData("alignment.phy", "map_file.txt", "outgroup",
                    num_ind, num_taxa, num_sites)
res = dat.bootstrap_triple("P1", "Hybrid", "P2", reps=1000)
# gamma (admixture proportion): 0 = pure P1, 1 = pure P2
# Significant p-value + gamma ∈ (0,1) = hybridization detected
Download:

SNaQ (PhyloNetworks in Julia)

Julia
using PhyloNetworks

# Read gene trees (concordance factor table from IQ-TREE or BUCKy)
cf_table = readTableCF("concordance_factors.csv")

# Start from a tree (e.g., ASTRAL species tree)
start_tree = readTopology("species_tree.nwk")

# Estimate network with 0 hybridizations (= tree)
net0 = snaq!(start_tree, cf_table, hmax=0, runs=10)

# Estimate with 1 hybridization
net1 = snaq!(net0, cf_table, hmax=1, runs=10)

# Estimate with 2 hybridizations
net2 = snaq!(net1, cf_table, hmax=2, runs=10)

# Compare models using log-pseudolikelihood
# Use the "elbow" method: when adding another h doesn't improve much
println("h=0: ", net0.loglik)
println("h=1: ", net1.loglik)
println("h=2: ", net2.loglik)

# Plot network
using PhyloPlots
plot(net1, showGamma=true)  # gamma = inheritance proportion
Download:

PhyloNet — Network Inference

NEXUS
# PhyloNet: InferNetwork_MPL (maximum pseudo-likelihood)
#NEXUS
BEGIN TREES;
Tree gt1 = (A,(B,(C,D)));
Tree gt2 = (A,(C,(B,D)));
Tree gt3 = (A,(B,(C,D)));
[... more gene trees ...]
END;

BEGIN PHYLONET;
InferNetwork_MPL (gt1,gt2,gt3) 1 -x 10 -pl 16;
[1 = max number of reticulations]
[-x 10 = 10 independent runs]
[-pl 16 = 16 threads]
END;

SplitsTree — Exploratory Network Visualization

Bash
# SplitsTree: quick visual check for reticulation
# Input: alignment or distance matrix
# GUI tool — load alignment → Distances → NeighborNet
# Non-tree-like structure = possible hybridization/recombination

# Command line (SplitsTree5):
SplitsTree5 -i alignment.fasta -o network.nex
Mnemonic

"Networks for Non-tree-like evolution"

If gene trees are discordant and the discordance is directional (not random ILS), you need a network, not a tree. Start with SplitsTree for visualization, HyDe for quick tests, SNaQ for formal network estimation.

1.7 — Topology Testing & Hypothesis Comparison

Available Tests

TestNull HypothesisOne-sided?Recommended?
AU testTree is as good as the ML treeNo (multi-scale bootstrap)Yes — gold standard
SH testTree is as good as the ML treeNoConservative, may fail to reject bad trees
KH testTwo trees are equally goodOnly for pre-specified treesOnly for a priori hypotheses
SOWH testConstrained tree is trueParametric bootstrapMost powerful, but slow
Bayes factorsModel comparisonN/ABest in Bayesian framework
Bash — IQ-TREE Topology Tests
# Step 1: Collect candidate trees (ML tree + alternatives)
cat ml_tree.nwk alternative1.nwk alternative2.nwk > candidate_trees.nwk

# Step 2: Run all topology tests
iqtree2 -s alignment.phy -m GTR+G \
  -z candidate_trees.nwk \
  -zb 10000 -zw -au \
  -T AUTO --prefix topotest

# -zb 10000: 10000 RELL bootstrap replicates
# -zw: weighted SH and KH tests
# -au: AU test

# Output: topotest.iqtree contains a table:
# Tree  logL     deltaL   bp-RELL  p-KH   p-SH   p-WKH  p-WSH  c-ELW  p-AU
# 1     -45678   0        0.89     0.93   0.98   0.92   0.97   0.88   0.95
# 2     -45712   34       0.08     0.07   0.12   0.08   0.13   0.09   0.04*
# 3     -45745   67       0.03     0.02*  0.04*  0.02*  0.04*  0.03   0.01*
# * = rejected at p < 0.05
# Trees not rejected by AU test cannot be statistically distinguished
Download:

Four-Cluster Likelihood Mapping

Bash
# IQ-TREE likelihood mapping (test phylogenetic signal)
iqtree2 -s alignment.phy -m MFP -lmap 10000 -T AUTO

# Output: .lmap.svg and .iqtree
# Triangle plot: corners = fully resolved, center = unresolved
# >90% in corners = strong signal
# >30% in center = insufficient signal for this dataset

# Four-cluster likelihood mapping (custom taxon groups):
# Create a file defining 4 clusters:
echo "ClusterA: TaxA1 TaxA2 TaxA3
ClusterB: TaxB1 TaxB2
ClusterC: TaxC1 TaxC2
ClusterD: TaxD1 TaxD2 TaxD3" > clusters.txt

iqtree2 -s alignment.phy -m MFP -lmap 10000 \
  -lmclust clusters.txt -T AUTO
Thumb Rule — Topology Testing: The AU test is the standard. If a tree is NOT rejected by the AU test, it cannot be statistically distinguished from the ML tree — even if its likelihood is lower. Report AU p-values for all alternative hypotheses in your paper.

1.8 — Long Branch Attraction & Systematic Error

Identifying LBA and Compositional Bias

Long Branch Attraction (LBA) = fast-evolving taxa artificially group together regardless of their true relationship. Compositional bias = taxa with similar base composition cluster together. Both are systematic errors that more data makes worse, not better.

Diagnostic Signs of Systematic Error
  • Two long-branched taxa group together, but removing one changes the topology
  • Adding more loci increases bootstrap but doesn't change the (suspect) topology
  • Amino acid composition differs dramatically between clades (chi-squared test)
  • Posterior predictive simulations reject the model for composition
  • Simpler models (GTR+G) give a different topology than complex models (CAT-GTR)

Amino Acid Recoding

Recoding amino acids into fewer categories reduces compositional signal and saturation at the cost of losing some phylogenetic information.

Bash / Python
# Dayhoff6 recoding: 20 amino acids → 6 categories
# Group 1 (ASTGP): Small
# Group 2 (DNEQ): Acidic/amide
# Group 3 (RKH): Basic
# Group 4 (MVIL): Hydrophobic
# Group 5 (FYW): Aromatic
# Group 6 (C): Cysteine

# Python script for Dayhoff6 recoding:
python3 -c "
import sys
dayhoff6 = {
    'A':'0','S':'0','T':'0','G':'0','P':'0',
    'D':'1','N':'1','E':'1','Q':'1',
    'R':'2','K':'2','H':'2',
    'M':'3','V':'3','I':'3','L':'3',
    'F':'4','Y':'4','W':'4',
    'C':'5',
    '-':'-','X':'?','?':'?'
}
for line in open(sys.argv[1]):
    if line.startswith('>'):
        print(line.strip())
    else:
        print(''.join(dayhoff6.get(c.upper(),'?') for c in line.strip()))
" alignment.fasta > alignment_dayhoff6.fasta

# SR4 recoding: 20 → 4 categories (even more aggressive)
# A/G/N/P/S/T → 0
# C/H/W/Y → 1
# D/E/K/Q/R → 2
# F/I/L/M/V → 3

# Run IQ-TREE on recoded data with appropriate model
iqtree2 -s alignment_dayhoff6.fasta -m GTR+G \
  -st MORPH -T AUTO -bb 1000
Download:

Removing Fast-Evolving Sites or Taxa

Bash
# IQ-TREE: site rates estimation
iqtree2 -s alignment.phy -m LG+G -T AUTO --rate
# Output: .rate file with per-site rates

# Remove fastest 10%, 20%, 30% of sites progressively
# and re-run the tree each time
# If the topology changes = fast sites were misleading

# Slow-fast analysis in R:
# Read rates, sort, remove top quantile, subset alignment

# BaCoCa: batch composition checker
# Tests each taxon's amino acid composition against average
# https://github.com/simonwhelan/BaCoCa
python BaCoCa.py -i alignment.fasta
# Chi-squared p-value < 0.05 for a taxon = compositional outlier

Site-Heterogeneous Models as the Solution

Byte key
The best defense against LBA and compositional bias:
  1. Use CAT-GTR (PhyloBayes) — accounts for site-specific amino acid profiles
  2. Use C60/PMSF (IQ-TREE) — fast approximation of site heterogeneity
  3. Recode amino acids (Dayhoff6/SR4) to reduce compositional signal
  4. Remove saturated/fast sites and check if topology is stable
  5. Remove suspect long-branch taxa one at a time and check topology stability
  6. Run both approaches and report whether they agree
Mnemonic

"More data + Wrong model = More Wrong"

Systematic error is the opposite of stochastic error. Adding more genes with a misspecified model increases bootstrap for the WRONG tree. The fix is a better model, not more data.

Thumb Rule

LBA sensitivity analysis

If removing one long-branch taxon changes where another long-branch taxon falls in the tree, LBA is operating. Report both topologies and discuss. Use site-heterogeneous models and amino acid recoding as cross-checks.

AI Prompt Guide

Discordance & Network Analysis
I have [NUMBER] gene trees from [NUMBER] taxa spanning [CLADE]. I suspect [ILS / hybridization / both] based on preliminary results. My concatenation tree disagrees with ASTRAL for [SPECIFIC NODE]. Please write a pipeline that: 1. Computes gene and site concordance factors (gCF/sCF) in IQ-TREE 2. Runs ABBA-BABA / D-statistics to test for introgression at that node 3. Estimates a phylogenetic network with SNaQ (PhyloNetworks, Julia) testing h=0,1,2,3 4. Runs HyDe for all possible hybridization triplets 5. Tests alternative topologies with the AU test in IQ-TREE 6. Performs amino acid recoding (Dayhoff6) and re-runs the tree to check for LBA 7. Visualizes discordance with DensiTree 8. Provides interpretation guidelines for each result

Troubleshooting

ASTRAL: "Input gene trees have polytomies"

ASTRAL handles polytomies natively (treats them as uncertainty). But if you want to resolve them first: collapse branches with low support (nw_ed trees.nwk 'i & b<10' o) to create genuine polytomies rather than poorly-resolved bifurcations.

SNaQ: Pseudo-likelihood doesn't improve with more reticulations

The data may genuinely be tree-like (no hybridization). Or the signal-to-noise ratio is too low. Try: (1) use more gene trees, (2) ensure concordance factor table is computed correctly, (3) check that your starting tree is reasonable.

AU test: All alternative trees rejected

This means the ML tree is significantly better than all tested alternatives. This is a strong result — but verify your alternative trees are correctly specified (check rooting, taxon names, Newick format).

Dayhoff6 recoding gives different topology

This is informative. It suggests that the original topology was influenced by compositional bias or saturation. The recoded result is typically more reliable for deep divergences, though it has less power (fewer character states). Report both and discuss which biological scenario each supports.

Mnemonics & Thumb Rules

Mnemonic

"Concatenate for Confidence, Coalescent for Correctness"

Concatenation gives high bootstrap (because of sheer data volume) but can be positively misleading under ILS. Coalescent methods (ASTRAL) are statistically consistent under the MSC. Always run both and compare.

Mnemonic

"AU for All Uses" — the best topology test

The Approximately Unbiased test is the recommended topology test. SH is too conservative (rarely rejects). KH requires pre-specified hypotheses. The AU test correctly handles selection bias and works for any set of candidate trees.

Thumb Rule — Network Complexity: Start with h=0 (tree), then h=1, h=2. If log-pseudolikelihood doesn't improve by >5 units when adding another reticulation, stop. Most real datasets have 0–2 reticulations. Networks with h>3 are rarely well-supported.
Thumb Rule — Species Trees: Need ≥50 gene trees for a reliable ASTRAL analysis. 200+ is ideal. Each gene tree should have ≥200 informative sites. Very short loci (< 300 bp) contribute mostly noise — either concatenate them or exclude them.