Snakemake/Nextflow Pipelines · RevBayes Custom Models · Simulation for Validation · Publication-Quality Tree Figures
"""
Snakemake pipeline: raw loci → gene trees → species tree
"""
import glob
LOCI = glob_wildcards("loci/{locus}.fasta").locus
rule all:
input:
"results/species_tree.nwk",
"results/concordance.cf.tree",
"results/dated_tree.nex"
rule align:
input: "loci/{locus}.fasta"
output: "aligned/{locus}.aln"
threads: 4
conda: "envs/phylo.yaml"
shell: "mafft --auto --thread {threads} {input} > {output}"
rule trim:
input: "aligned/{locus}.aln"
output: "trimmed/{locus}.trim"
conda: "envs/phylo.yaml"
shell: "trimal -in {input} -out {output} -automated1"
rule gene_tree:
input: "trimmed/{locus}.trim"
output: "gene_trees/{locus}.treefile"
threads: 4
conda: "envs/phylo.yaml"
shell:
"iqtree2 -s {input} -m MFP -bb 1000 "
"-T {threads} --prefix gene_trees/{wildcards.locus}"
rule collect_gene_trees:
input: expand("gene_trees/{locus}.treefile", locus=LOCI)
output: "results/all_gene_trees.nwk"
shell: "cat {input} > {output}"
rule astral:
input: "results/all_gene_trees.nwk"
output: "results/species_tree.nwk"
conda: "envs/phylo.yaml"
shell:
"java -jar $CONDA_PREFIX/share/astral/astral.jar "
"-i {input} -o {output} -t 2"
rule concatenate:
input: expand("trimmed/{locus}.trim", locus=LOCI)
output: "results/concat.phy"
conda: "envs/phylo.yaml"
script: "scripts/concatenate.py"
rule concordance:
input:
tree="results/species_tree.nwk",
genes="results/all_gene_trees.nwk",
aln="results/concat.phy"
output: "results/concordance.cf.tree"
threads: 8
conda: "envs/phylo.yaml"
shell:
"iqtree2 -t {input.tree} --gcf {input.genes} "
"-s {input.aln} --scf 100 "
"-T {threads} --prefix results/concordance"
rule dating:
input:
tree="results/species_tree.nwk",
aln="results/concat.phy"
output: "results/dated_tree.nex"
threads: 16
conda: "envs/phylo.yaml"
shell:
"mcmctree mcmctree.ctl"
# Local run
snakemake --cores 32 --use-conda
# SLURM cluster execution
snakemake --cores 500 --use-conda \
--cluster "sbatch --cpus-per-task={threads} --mem=32G \
--time=24:00:00 --output=logs/{rule}_{wildcards}.log" \
--jobs 50
# With Snakemake profiles (recommended for HPC):
snakemake --profile slurm_profile/
# Dry run (check what would be executed)
snakemake -n
# Visualize DAG
snakemake --dag | dot -Tpng > pipeline_dag.png
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.loci = "loci/*.fasta"
params.outdir = "results"
process ALIGN {
conda 'envs/phylo.yaml'
cpus 4
input: path(locus)
output: path("${locus.baseName}.aln")
script: "mafft --auto --thread $task.cpus $locus > ${locus.baseName}.aln"
}
process GENE_TREE {
conda 'envs/phylo.yaml'
cpus 4
input: path(aln)
output: path("${aln.baseName}.treefile")
script:
"""
iqtree2 -s $aln -m MFP -bb 1000 -T $task.cpus \
--prefix ${aln.baseName}
"""
}
process ASTRAL {
conda 'envs/phylo.yaml'
publishDir params.outdir
input: path(trees)
output: path("species_tree.nwk")
script:
"""
cat $trees > all_trees.nwk
java -jar \$ASTRAL_JAR -i all_trees.nwk -o species_tree.nwk -t 2
"""
}
workflow {
loci_ch = Channel.fromPath(params.loci)
aligned = ALIGN(loci_ch)
gene_trees = GENE_TREE(aligned)
ASTRAL(gene_trees.collect())
}
RevBayes is a programmable Bayesian framework where you specify your own graphical model. Unlike BEAST2 (XML configuration), RevBayes uses the Rev language — a scripting language for probabilistic graphical models.
# RevBayes: GTR+G model with relaxed clock
# Save as: analysis.Rev
# Read data
data = readDiscreteCharacterData("alignment.nex")
n_taxa = data.ntaxa()
taxa = data.taxa()
# Tree topology + branch lengths (uniform prior on topology)
topology ~ dnUniformTopology(taxa)
for (i in 1:2*n_taxa-3) {
br_lens[i] ~ dnExponential(10.0)
moves[i] = mvScale(br_lens[i], weight=1.0)
}
phylogeny := treeAssembly(topology, br_lens)
moves[2*n_taxa-2] = mvNNI(topology, weight=10.0)
moves[2*n_taxa-1] = mvSPR(topology, weight=10.0)
# GTR substitution model
er ~ dnDirichlet(v(1,1,1,1,1,1))
pi ~ dnDirichlet(v(1,1,1,1))
moves[2*n_taxa] = mvBetaSimplex(er, weight=3.0)
moves[2*n_taxa+1] = mvBetaSimplex(pi, weight=3.0)
Q := fnGTR(er, pi)
# Gamma rate variation
alpha ~ dnExponential(1.0)
moves[2*n_taxa+2] = mvScale(alpha, weight=2.0)
site_rates := fnDiscretizeGamma(alpha, alpha, 4)
# Phylogenetic CTMC
seq ~ dnPhyloCTMC(tree=phylogeny, Q=Q,
siteRates=site_rates, type="DNA")
seq.clamp(data)
# MCMC
mymodel = model(phylogeny)
monitors[1] = mnModel(filename="output.log", printgen=100)
monitors[2] = mnFile(phylogeny, filename="output.trees", printgen=100)
monitors[3] = mnScreen(printgen=1000)
mymcmc = mcmc(mymodel, monitors, moves)
mymcmc.run(generations=1000000)
# Custom model example: site-heterogeneous mixture
# (RevBayes can define ANY model — this is its superpower)
# n_cats = 4
# for (i in 1:n_cats) {
# pi_mix[i] ~ dnDirichlet(v(1,1,1,1))
# Q_mix[i] := fnGTR(er, pi_mix[i])
# }
# mix_weights ~ dnDirichlet(rep(1, n_cats))
# Q_mixture := fnMixture(Q_mix, mix_weights)
| Simulator | What It Simulates | Use Case |
|---|---|---|
| Seq-Gen | Sequences along a fixed tree | Testing tree inference accuracy |
| INDELible | Sequences with indels along a tree | Alignment method benchmarking |
| SimPhy | Gene trees under multispecies coalescent | Testing species tree methods |
| msprime | Coalescent genealogies + mutations | Population genetics simulations |
| SLiM | Forward-time evolution (selection, demography) | Complex evolutionary scenarios |
# Seq-Gen: simulate sequences on a known tree # Test: can IQ-TREE recover the true tree? seq-gen -mGTR -l 1000 -n 100 -s 0.5 \ -f 0.3 0.2 0.2 0.3 \ -r 1.0 2.0 1.0 1.0 2.0 1.0 \ -a 0.5 \ < true_tree.nwk > simulated_alignments.phy # -m GTR: substitution model # -l 1000: 1000 sites # -n 100: 100 replicate datasets # -s 0.5: tree scale (substitutions per site) # -a 0.5: alpha parameter for gamma rate variation # SimPhy: simulate gene trees under the MSC + sequences simphy -rs 100 -rl f:500 -rg 1 \ -sp f:20 -sl f:10 -sb f:0.000001 \ -st f:1000000 -si f:1 \ -so f:1 -od 1 -op 1 -oc 1 \ -ol f:500 -om f:1 \ -v 3 -cs 293745 -o simphy_output # Then simulate sequences on each gene tree with Seq-Gen or INDELible # INDELible: simulate with insertions and deletions # control.txt: # [TYPE] NUCLEOTIDE 1 # [MODEL] mymodel # [submodel] GTR 0.5 2.0 0.5 0.5 2.0 # [statefreq] 0.3 0.2 0.2 0.3 # [rates] 0.5 0.2 4 # [indelmodel] POW 1.7 500 # [indelrate] 0.1 # [TREE] t1 [rooted] ((A:0.1,B:0.2):0.3,(C:0.15,D:0.25):0.05); # [PARTITIONS] part1 [t1 mymodel 1000] # [EVOLVE] part1 100 sim_output indelible
import msprime
import tskit
# Simulate a species tree with gene trees under the MSC
# Two populations that diverged 100,000 generations ago
demography = msprime.Demography()
demography.add_population(name="Pop_A", initial_size=10000)
demography.add_population(name="Pop_B", initial_size=10000)
demography.add_population(name="Ancestor", initial_size=20000)
demography.add_population_split(
time=100000, derived=["Pop_A", "Pop_B"], ancestral="Ancestor"
)
# Simulate 100 independent loci (gene trees)
ts = msprime.sim_ancestry(
samples={"Pop_A": 10, "Pop_B": 10},
demography=demography,
sequence_length=10000,
num_replicates=100,
recombination_rate=1e-8,
random_seed=42
)
# Add mutations to each tree sequence
for i, tree_seq in enumerate(ts):
mts = msprime.sim_mutations(
tree_seq, rate=1e-8, random_seed=42+i
)
# Export as VCF, FASTA, or tree sequence
with open(f"locus_{i}.vcf", "w") as f:
mts.write_vcf(f)
# Analyze tree sequences directly with tskit
# (much faster than sequence-based analysis)
ts_single = msprime.sim_ancestry(
samples={"Pop_A": 50, "Pop_B": 50},
demography=demography,
sequence_length=1e7,
recombination_rate=1e-8
)
mts = msprime.sim_mutations(ts_single, rate=1e-8)
# Compute diversity directly from tree sequence
diversity_A = mts.diversity(sample_sets=[
[s.id for s in mts.samples() if mts.node(s.id).population == 0]
])
print(f"Pi (Pop A): {diversity_A}")
library(ggtree)
library(treeio)
library(ggplot2)
library(ggtreeExtra)
# Basic tree with bootstrap support
tree <- read.tree("species_tree.nwk")
p <- ggtree(tree) +
geom_tiplab(size=3) +
geom_nodelab(aes(label=label), size=2, hjust=-0.1) +
theme_tree2()
# Dated tree with 95% HPD node bars
beast_tree <- read.beast("dated_tree.nex")
p_dated <- ggtree(beast_tree, mrsd="2024-01-01") +
geom_range("height_0.95_HPD", color="steelblue", alpha=0.4, size=2) +
geom_tiplab(size=2.5) +
scale_x_continuous(labels=abs) +
xlab("Time (Ma)") +
theme_tree2()
# Annotate with trait data (heatmap next to tree)
trait_data <- read.csv("traits.csv", row.names=1)
p_heat <- gheatmap(p, trait_data, offset=0.05, width=0.3,
colnames_angle=45, colnames_offset_y=0.25) +
scale_fill_viridis_d() +
theme(legend.position="right")
# Multi-panel phylogenomic figure
# Panel 1: Species tree with support
# Panel 2: Concordance factors
# Panel 3: Divergence times
library(patchwork)
fig <- p_support + p_concordance + p_dated +
plot_layout(nrow=1, widths=c(1, 1, 1.5)) +
plot_annotation(tag_levels='A')
ggsave("phylogenomic_figure.pdf", fig, width=18, height=8)
# Circular tree with rate heatmap
p_circ <- ggtree(beast_tree, layout="circular") +
geom_tiplab2(size=2) +
geom_treescale()
# Co-phylogeny (tanglegram)
library(phytools)
cophylo_obj <- cophylo(host_tree, parasite_tree, assoc_matrix)
plot(cophylo_obj, link.type="curved", link.col="gray50",
fsize=0.7, pts=FALSE)
# Export for iTOL (online visualization)
# Upload tree to https://itol.embl.de/
# Add datasets as annotation files:
# DATASET_COLORSTRIP, DATASET_HEATMAP, DATASET_BINARY, etc.
import toytree
import toyplot
import numpy as np
# Load tree
tree = toytree.tree("species_tree.nwk")
# Basic plot with support values
canvas, axes, mark = tree.draw(
width=400, height=600,
tip_labels_align=True,
node_labels="support",
node_sizes=8,
node_colors=toytree.colors[2]
)
# Color branches by a trait
trait_colors = {
"SpeciesA": "red", "SpeciesB": "red",
"SpeciesC": "blue", "SpeciesD": "blue",
}
edge_colors = []
for node in tree.treenode.traverse():
if node.is_leaf():
edge_colors.append(trait_colors.get(node.name, "black"))
else:
edge_colors.append("black")
canvas, axes, mark = tree.draw(
edge_colors=edge_colors,
edge_widths=2,
tip_labels_align=True
)
# Save as SVG/PDF
import toyplot.svg
toyplot.svg.render(canvas, "tree_figure.svg")
| Tool | Language | Best For | Output |
|---|---|---|---|
| ggtree | R | Publication figures, complex annotations | PDF/SVG/PNG |
| iTOL | Web | Interactive exploration, large trees | SVG/PDF/PNG |
| FigTree | Java GUI | Quick viewing, BEAST2 output | SVG/PDF/PNG |
| toytree | Python | Pythonic workflow, Jupyter notebooks | SVG/HTML |
| Dendroscope | Java GUI | Very large trees (100k+ tips), networks | SVG/PDF |
| phytools | R | Trait mapping, ancestral states on trees | PDF/PNG |
Figure production workflow
Use ggtree or toytree for the base tree layout → annotate programmatically → export as SVG → final touch-up in Inkscape/Illustrator if needed. Never manually draw phylogenies — always generate them from the actual tree file for reproducibility.
Common HPC issues: (1) Check cluster logs (not just Snakemake output), (2) Ensure conda environments are created before submission (snakemake --use-conda --conda-create-envs-only), (3) Set --latency-wait 60 for NFS file systems with slow metadata, (4) Use --rerun-incomplete to retry failed jobs.
Rev language quirks: (1) Use := for deterministic variables, ~ for stochastic, = for constants, (2) Indexing starts at 1 (not 0), (3) Every move needs a weight, (4) Monitor all parameters you care about. Consult RevBayes tutorials at revbayes.github.io.
The most common ggtree error. Ensure: (1) Row names in your data frame exactly match tip labels in the tree, (2) No extra whitespace or special characters, (3) Use tree$tip.label to check. Use drop.tip() to remove tips not in your data or add missing rows with NA.
All times in msprime are in generations (not years). If your split time is 100,000 years and generation time is 25 years, use time=4000 generations. Also ensure derived populations exist before their split time and initial sizes are positive.
"Simulate, Analyze, Compare — SAC"
The validation cycle: Simulate data under known conditions → Analyze with your pipeline → Compare inferred results to known truth. If your pipeline can't recover known truths from simulations, it won't recover truths from real data either.
"ggtree for Glamour, iTOL for Interactive"
Use ggtree (R) when you need fine-grained control for publication figures. Use iTOL (web) when you need to quickly explore and share annotated trees. Both can produce publication-quality output, but ggtree gives you reproducible R scripts.