Module 8 — Workflows, Simulation & Visualization

Snakemake/Nextflow Pipelines · RevBayes Custom Models · Simulation for Validation · Publication-Quality Tree Figures

~120 min Advanced Snakemake / Rev / R / Python
Byte

10.1 — Pipeline Construction (Snakemake & Nextflow)

Snakemake — Phylogenomic Pipeline
"""
Snakemake pipeline: raw loci → gene trees → species tree
"""
import glob

LOCI = glob_wildcards("loci/{locus}.fasta").locus

rule all:
    input:
        "results/species_tree.nwk",
        "results/concordance.cf.tree",
        "results/dated_tree.nex"

rule align:
    input:  "loci/{locus}.fasta"
    output: "aligned/{locus}.aln"
    threads: 4
    conda:  "envs/phylo.yaml"
    shell:  "mafft --auto --thread {threads} {input} > {output}"

rule trim:
    input:  "aligned/{locus}.aln"
    output: "trimmed/{locus}.trim"
    conda:  "envs/phylo.yaml"
    shell:  "trimal -in {input} -out {output} -automated1"

rule gene_tree:
    input:  "trimmed/{locus}.trim"
    output: "gene_trees/{locus}.treefile"
    threads: 4
    conda:  "envs/phylo.yaml"
    shell:
        "iqtree2 -s {input} -m MFP -bb 1000 "
        "-T {threads} --prefix gene_trees/{wildcards.locus}"

rule collect_gene_trees:
    input:  expand("gene_trees/{locus}.treefile", locus=LOCI)
    output: "results/all_gene_trees.nwk"
    shell:  "cat {input} > {output}"

rule astral:
    input:  "results/all_gene_trees.nwk"
    output: "results/species_tree.nwk"
    conda:  "envs/phylo.yaml"
    shell:
        "java -jar $CONDA_PREFIX/share/astral/astral.jar "
        "-i {input} -o {output} -t 2"

rule concatenate:
    input:  expand("trimmed/{locus}.trim", locus=LOCI)
    output: "results/concat.phy"
    conda:  "envs/phylo.yaml"
    script: "scripts/concatenate.py"

rule concordance:
    input:
        tree="results/species_tree.nwk",
        genes="results/all_gene_trees.nwk",
        aln="results/concat.phy"
    output: "results/concordance.cf.tree"
    threads: 8
    conda:  "envs/phylo.yaml"
    shell:
        "iqtree2 -t {input.tree} --gcf {input.genes} "
        "-s {input.aln} --scf 100 "
        "-T {threads} --prefix results/concordance"

rule dating:
    input:
        tree="results/species_tree.nwk",
        aln="results/concat.phy"
    output: "results/dated_tree.nex"
    threads: 16
    conda:  "envs/phylo.yaml"
    shell:
        "mcmctree mcmctree.ctl"
Download:
Bash — Running Snakemake on HPC
# Local run
snakemake --cores 32 --use-conda

# SLURM cluster execution
snakemake --cores 500 --use-conda \
  --cluster "sbatch --cpus-per-task={threads} --mem=32G \
  --time=24:00:00 --output=logs/{rule}_{wildcards}.log" \
  --jobs 50

# With Snakemake profiles (recommended for HPC):
snakemake --profile slurm_profile/

# Dry run (check what would be executed)
snakemake -n

# Visualize DAG
snakemake --dag | dot -Tpng > pipeline_dag.png

Nextflow Alternative

Nextflow
#!/usr/bin/env nextflow
nextflow.enable.dsl=2

params.loci = "loci/*.fasta"
params.outdir = "results"

process ALIGN {
    conda 'envs/phylo.yaml'
    cpus 4
    input: path(locus)
    output: path("${locus.baseName}.aln")
    script: "mafft --auto --thread $task.cpus $locus > ${locus.baseName}.aln"
}

process GENE_TREE {
    conda 'envs/phylo.yaml'
    cpus 4
    input: path(aln)
    output: path("${aln.baseName}.treefile")
    script:
    """
    iqtree2 -s $aln -m MFP -bb 1000 -T $task.cpus \
      --prefix ${aln.baseName}
    """
}

process ASTRAL {
    conda 'envs/phylo.yaml'
    publishDir params.outdir
    input: path(trees)
    output: path("species_tree.nwk")
    script:
    """
    cat $trees > all_trees.nwk
    java -jar \$ASTRAL_JAR -i all_trees.nwk -o species_tree.nwk -t 2
    """
}

workflow {
    loci_ch = Channel.fromPath(params.loci)
    aligned = ALIGN(loci_ch)
    gene_trees = GENE_TREE(aligned)
    ASTRAL(gene_trees.collect())
}
Thumb Rule — Pipeline Choice: Snakemake = Python-native, easier to learn, great for academic labs. Nextflow = Java-based, better for cloud/container environments, preferred by large-scale bioinformatics cores. Both achieve the same goal — pick the one your lab already uses.

10.2 — RevBayes for Custom Evolutionary Models

RevBayes is a programmable Bayesian framework where you specify your own graphical model. Unlike BEAST2 (XML configuration), RevBayes uses the Rev language — a scripting language for probabilistic graphical models.

Rev — Basic Phylogenetic Analysis
# RevBayes: GTR+G model with relaxed clock
# Save as: analysis.Rev

# Read data
data = readDiscreteCharacterData("alignment.nex")
n_taxa = data.ntaxa()
taxa = data.taxa()

# Tree topology + branch lengths (uniform prior on topology)
topology ~ dnUniformTopology(taxa)
for (i in 1:2*n_taxa-3) {
    br_lens[i] ~ dnExponential(10.0)
    moves[i] = mvScale(br_lens[i], weight=1.0)
}
phylogeny := treeAssembly(topology, br_lens)
moves[2*n_taxa-2] = mvNNI(topology, weight=10.0)
moves[2*n_taxa-1] = mvSPR(topology, weight=10.0)

# GTR substitution model
er ~ dnDirichlet(v(1,1,1,1,1,1))
pi ~ dnDirichlet(v(1,1,1,1))
moves[2*n_taxa] = mvBetaSimplex(er, weight=3.0)
moves[2*n_taxa+1] = mvBetaSimplex(pi, weight=3.0)
Q := fnGTR(er, pi)

# Gamma rate variation
alpha ~ dnExponential(1.0)
moves[2*n_taxa+2] = mvScale(alpha, weight=2.0)
site_rates := fnDiscretizeGamma(alpha, alpha, 4)

# Phylogenetic CTMC
seq ~ dnPhyloCTMC(tree=phylogeny, Q=Q,
                    siteRates=site_rates, type="DNA")
seq.clamp(data)

# MCMC
mymodel = model(phylogeny)
monitors[1] = mnModel(filename="output.log", printgen=100)
monitors[2] = mnFile(phylogeny, filename="output.trees", printgen=100)
monitors[3] = mnScreen(printgen=1000)

mymcmc = mcmc(mymodel, monitors, moves)
mymcmc.run(generations=1000000)

# Custom model example: site-heterogeneous mixture
# (RevBayes can define ANY model — this is its superpower)
# n_cats = 4
# for (i in 1:n_cats) {
#     pi_mix[i] ~ dnDirichlet(v(1,1,1,1))
#     Q_mix[i] := fnGTR(er, pi_mix[i])
# }
# mix_weights ~ dnDirichlet(rep(1, n_cats))
# Q_mixture := fnMixture(Q_mix, mix_weights)
Download:
Byte
When to use RevBayes over BEAST2/MrBayes:
  • You need a model not implemented in any GUI tool
  • You want to combine tree inference with custom trait evolution models
  • You need joint estimation of phylogeny + diversification + trait evolution
  • You want full control over the graphical model specification
  • Teaching: RevBayes makes the statistical model explicit and transparent

10.3 — Simulation for Validation

SimulatorWhat It SimulatesUse Case
Seq-GenSequences along a fixed treeTesting tree inference accuracy
INDELibleSequences with indels along a treeAlignment method benchmarking
SimPhyGene trees under multispecies coalescentTesting species tree methods
msprimeCoalescent genealogies + mutationsPopulation genetics simulations
SLiMForward-time evolution (selection, demography)Complex evolutionary scenarios
Bash — Seq-Gen & SimPhy
# Seq-Gen: simulate sequences on a known tree
# Test: can IQ-TREE recover the true tree?
seq-gen -mGTR -l 1000 -n 100 -s 0.5 \
  -f 0.3 0.2 0.2 0.3 \
  -r 1.0 2.0 1.0 1.0 2.0 1.0 \
  -a 0.5 \
  < true_tree.nwk > simulated_alignments.phy
# -m GTR: substitution model
# -l 1000: 1000 sites
# -n 100: 100 replicate datasets
# -s 0.5: tree scale (substitutions per site)
# -a 0.5: alpha parameter for gamma rate variation

# SimPhy: simulate gene trees under the MSC + sequences
simphy -rs 100 -rl f:500 -rg 1 \
  -sp f:20 -sl f:10 -sb f:0.000001 \
  -st f:1000000 -si f:1 \
  -so f:1 -od 1 -op 1 -oc 1 \
  -ol f:500 -om f:1 \
  -v 3 -cs 293745 -o simphy_output
# Then simulate sequences on each gene tree with Seq-Gen or INDELible

# INDELible: simulate with insertions and deletions
# control.txt:
# [TYPE] NUCLEOTIDE 1
# [MODEL] mymodel
#   [submodel] GTR 0.5 2.0 0.5 0.5 2.0
#   [statefreq] 0.3 0.2 0.2 0.3
#   [rates] 0.5 0.2 4
#   [indelmodel] POW 1.7 500
#   [indelrate] 0.1
# [TREE] t1 [rooted] ((A:0.1,B:0.2):0.3,(C:0.15,D:0.25):0.05);
# [PARTITIONS] part1 [t1 mymodel 1000]
# [EVOLVE] part1 100 sim_output
indelible
Download:
Python — msprime Coalescent Simulation
import msprime
import tskit

# Simulate a species tree with gene trees under the MSC
# Two populations that diverged 100,000 generations ago
demography = msprime.Demography()
demography.add_population(name="Pop_A", initial_size=10000)
demography.add_population(name="Pop_B", initial_size=10000)
demography.add_population(name="Ancestor", initial_size=20000)
demography.add_population_split(
    time=100000, derived=["Pop_A", "Pop_B"], ancestral="Ancestor"
)

# Simulate 100 independent loci (gene trees)
ts = msprime.sim_ancestry(
    samples={"Pop_A": 10, "Pop_B": 10},
    demography=demography,
    sequence_length=10000,
    num_replicates=100,
    recombination_rate=1e-8,
    random_seed=42
)

# Add mutations to each tree sequence
for i, tree_seq in enumerate(ts):
    mts = msprime.sim_mutations(
        tree_seq, rate=1e-8, random_seed=42+i
    )
    # Export as VCF, FASTA, or tree sequence
    with open(f"locus_{i}.vcf", "w") as f:
        mts.write_vcf(f)

# Analyze tree sequences directly with tskit
# (much faster than sequence-based analysis)
ts_single = msprime.sim_ancestry(
    samples={"Pop_A": 50, "Pop_B": 50},
    demography=demography,
    sequence_length=1e7,
    recombination_rate=1e-8
)
mts = msprime.sim_mutations(ts_single, rate=1e-8)

# Compute diversity directly from tree sequence
diversity_A = mts.diversity(sample_sets=[
    [s.id for s in mts.samples() if mts.node(s.id).population == 0]
])
print(f"Pi (Pop A): {diversity_A}")
Download:
Thumb Rule — Validate Before Publish: Before claiming any phylogenomic result, simulate data under the null hypothesis and your alternative hypothesis. Can your method distinguish them? If not, your result may be an artifact. Simulations are the cheapest experiment you can run.

10.4 — Tree Visualization & Figure Production

R — ggtree Publication Figures
library(ggtree)
library(treeio)
library(ggplot2)
library(ggtreeExtra)

# Basic tree with bootstrap support
tree <- read.tree("species_tree.nwk")
p <- ggtree(tree) +
  geom_tiplab(size=3) +
  geom_nodelab(aes(label=label), size=2, hjust=-0.1) +
  theme_tree2()

# Dated tree with 95% HPD node bars
beast_tree <- read.beast("dated_tree.nex")
p_dated <- ggtree(beast_tree, mrsd="2024-01-01") +
  geom_range("height_0.95_HPD", color="steelblue", alpha=0.4, size=2) +
  geom_tiplab(size=2.5) +
  scale_x_continuous(labels=abs) +
  xlab("Time (Ma)") +
  theme_tree2()

# Annotate with trait data (heatmap next to tree)
trait_data <- read.csv("traits.csv", row.names=1)
p_heat <- gheatmap(p, trait_data, offset=0.05, width=0.3,
                    colnames_angle=45, colnames_offset_y=0.25) +
  scale_fill_viridis_d() +
  theme(legend.position="right")

# Multi-panel phylogenomic figure
# Panel 1: Species tree with support
# Panel 2: Concordance factors
# Panel 3: Divergence times
library(patchwork)
fig <- p_support + p_concordance + p_dated +
  plot_layout(nrow=1, widths=c(1, 1, 1.5)) +
  plot_annotation(tag_levels='A')
ggsave("phylogenomic_figure.pdf", fig, width=18, height=8)

# Circular tree with rate heatmap
p_circ <- ggtree(beast_tree, layout="circular") +
  geom_tiplab2(size=2) +
  geom_treescale()

# Co-phylogeny (tanglegram)
library(phytools)
cophylo_obj <- cophylo(host_tree, parasite_tree, assoc_matrix)
plot(cophylo_obj, link.type="curved", link.col="gray50",
     fsize=0.7, pts=FALSE)

# Export for iTOL (online visualization)
# Upload tree to https://itol.embl.de/
# Add datasets as annotation files:
# DATASET_COLORSTRIP, DATASET_HEATMAP, DATASET_BINARY, etc.
Download:
Python — toytree
import toytree
import toyplot
import numpy as np

# Load tree
tree = toytree.tree("species_tree.nwk")

# Basic plot with support values
canvas, axes, mark = tree.draw(
    width=400, height=600,
    tip_labels_align=True,
    node_labels="support",
    node_sizes=8,
    node_colors=toytree.colors[2]
)

# Color branches by a trait
trait_colors = {
    "SpeciesA": "red", "SpeciesB": "red",
    "SpeciesC": "blue", "SpeciesD": "blue",
}
edge_colors = []
for node in tree.treenode.traverse():
    if node.is_leaf():
        edge_colors.append(trait_colors.get(node.name, "black"))
    else:
        edge_colors.append("black")

canvas, axes, mark = tree.draw(
    edge_colors=edge_colors,
    edge_widths=2,
    tip_labels_align=True
)

# Save as SVG/PDF
import toyplot.svg
toyplot.svg.render(canvas, "tree_figure.svg")
Download:
ToolLanguageBest ForOutput
ggtreeRPublication figures, complex annotationsPDF/SVG/PNG
iTOLWebInteractive exploration, large treesSVG/PDF/PNG
FigTreeJava GUIQuick viewing, BEAST2 outputSVG/PDF/PNG
toytreePythonPythonic workflow, Jupyter notebooksSVG/HTML
DendroscopeJava GUIVery large trees (100k+ tips), networksSVG/PDF
phytoolsRTrait mapping, ancestral states on treesPDF/PNG
Thumb Rule

Figure production workflow

Use ggtree or toytree for the base tree layout → annotate programmatically → export as SVG → final touch-up in Inkscape/Illustrator if needed. Never manually draw phylogenies — always generate them from the actual tree file for reproducibility.

AI Prompt Guide

Phylogenomic Pipeline & Visualization
I need to build a complete, reproducible phylogenomic pipeline for [NUMBER] loci from [NUMBER] taxa. The pipeline should go from raw FASTA loci to: 1. Alignment (MAFFT) → trimming (trimAl) → gene trees (IQ-TREE) 2. Species tree (ASTRAL) + concatenation (IQ-TREE partitioned) 3. Concordance factors (gCF, sCF) 4. Divergence dating (MCMCTree with [NUMBER] fossil calibrations) 5. Ancestral state reconstruction for [TRAIT] Please write: 1. A complete Snakemake pipeline with conda environments 2. SLURM profile for HPC execution 3. Simulation validation script (SimPhy + Seq-Gen) to test pipeline accuracy 4. ggtree R script for a multi-panel publication figure showing: - Species tree with bootstrap + concordance factors - Dated tree with 95% HPD bars + geological timescale - Trait heatmap alongside the tree 5. A README with setup instructions

Troubleshooting

Snakemake: "MissingInputException" or jobs fail silently on cluster

Common HPC issues: (1) Check cluster logs (not just Snakemake output), (2) Ensure conda environments are created before submission (snakemake --use-conda --conda-create-envs-only), (3) Set --latency-wait 60 for NFS file systems with slow metadata, (4) Use --rerun-incomplete to retry failed jobs.

RevBayes: Syntax errors in Rev scripts

Rev language quirks: (1) Use := for deterministic variables, ~ for stochastic, = for constants, (2) Indexing starts at 1 (not 0), (3) Every move needs a weight, (4) Monitor all parameters you care about. Consult RevBayes tutorials at revbayes.github.io.

ggtree: Tree tips don't match data rows

The most common ggtree error. Ensure: (1) Row names in your data frame exactly match tip labels in the tree, (2) No extra whitespace or special characters, (3) Use tree$tip.label to check. Use drop.tip() to remove tips not in your data or add missing rows with NA.

msprime: "DemographyError: population split time must be > 0"

All times in msprime are in generations (not years). If your split time is 100,000 years and generation time is 25 years, use time=4000 generations. Also ensure derived populations exist before their split time and initial sizes are positive.

Mnemonics & Thumb Rules

Mnemonic

"Simulate, Analyze, Compare — SAC"

The validation cycle: Simulate data under known conditions → Analyze with your pipeline → Compare inferred results to known truth. If your pipeline can't recover known truths from simulations, it won't recover truths from real data either.

Mnemonic

"ggtree for Glamour, iTOL for Interactive"

Use ggtree (R) when you need fine-grained control for publication figures. Use iTOL (web) when you need to quickly explore and share annotated trees. Both can produce publication-quality output, but ggtree gives you reproducible R scripts.

Thumb Rule — Reproducibility: Every phylogenomic paper should provide: (1) All alignments and gene trees, (2) The species tree file, (3) The pipeline code (Snakemake/Nextflow), (4) Conda environment YAML, (5) All configuration files (BEAST2 XML, PAML ctl). Upload to Dryad, Zenodo, or GitHub.
Thumb Rule — msprime vs SLiM: Use msprime for demographic simulations (fast, coalescent-based). Use SLiM for selection simulations (forward-time, can model complex selection). For large neutral simulations with selection at specific sites, use SLiM with tree-sequence recording → msprime recap.