Detecting Positive Selection with PAML / CodeML

Identify genes under positive (diversifying) selection across E. coli and Shigella genomes — from OrthoFinder orthogroups to codon alignments, CodeML site model tests, and likelihood ratio statistics.

~150 min Advanced Bash + R
Byte

Overview & Theory

The ratio of nonsynonymous (dN) to synonymous (dS) substitutions — the ω (omega) ratio — is the gold standard for measuring selection at the molecular level:

ω (dN/dS)InterpretationBiological meaning
ω < 1Purifying (negative) selectionAmino acid changes are harmful; protein function conserved
ω = 1Neutral evolutionAmino acid changes have no fitness effect
ω > 1Positive (diversifying) selectionAmino acid changes are beneficial; protein adapting
Byte
Why CodeML site models? Even a positively selected gene will have most codons under purifying selection. A gene-wide ω > 1 is rare. CodeML's site models (M7 vs M8) test whether a subset of codons across the gene has ω > 1, which is far more sensitive. This tutorial is modelled after the analysis in Petersen & Bhatt (2007) identifying positively selected E. coli genes.

Ideal Directory Structure

The PAML/CodeML workflow requires running one analysis per orthogroup. Keep alignments, trees, and results in parallel numbered directories for easy batch processing and comparison.

positive_selection_ecoli/
positive_selection_ecoli/ ├── 20_TestDataset/ # raw genome FASTA + GFF files ├── proteins/ # extracted protein FASTA per species ├── cds/ # CDS nucleotide FASTA per species ├── 1_orthofinder/ # OrthoFinder single-copy orthologs │ └── SingleCopyOrthogroups/ ├── 2_codon_alignments/ # pal2nal codon-aware alignments │ ├── OG0000001.codon.phy # PHYLIP format for CodeML │ └── ... (one per orthogroup) ├── 3_species_tree/ # newick species tree for CodeML │ └── species.nwk ├── 4_codeml/ # CodeML runs (one folder per orthogroup) │ ├── OG0000001/ │ │ ├── M7.ctl # null model control file │ │ ├── M8.ctl # alternative model control file │ │ └── mlc # CodeML likelihood output │ └── ... (one per orthogroup) └── lrt_results.tsv # likelihood ratio test results for all genes
⏱️
Scale warning: For genome-wide analysis (3,000+ orthogroups), run CodeML as an array job — one job per orthogroup. Each CodeML run takes 30 seconds to 10 minutes depending on orthogroup size. Use parallel locally or SLURM array jobs on an HPC cluster.

Dataset

Five enterobacterial genomes spanning E. coli diversity plus the closely related Shigella flexneri as an outgroup:

Species / StrainNCBI AccessionSource
Shigella flexneri 2a 301GCF_000007405.1NCBI RefSeq
E. coli FORC_036 (food pathogen)GCF_001721125.1NCBI RefSeq
E. coli UMEA 3173-1GCF_003697165.2NCBI RefSeq
E. coli ATCC 35401GCF_013357365.1NCBI RefSeq
E. coli K12 MG1655GCF_000005845.2NCBI RefSeq

Step 1 — Download Genomes and Annotations

Bash — Download 5 genomes from NCBI
mkdir -p 20_TestDataset && cd 20_TestDataset

# Shigella flexneri
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.gff.gz

# E. coli FORC_036
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.gff.gz

# E. coli UMEA 3173-1
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.gff.gz

# E. coli ATCC 35401
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.gff.gz

# E. coli K12 MG1655
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz

gzip -d *.gz
echo "Genomes ready:"
ls *.fna | wc -l
Download:
Download and prep commands decoded
🌍wget NCBI FTP URLNCBI FTP paths follow a predictable pattern: /genomes/all/GCF/[3-digit split]/[accession]/[version]/. Download both the _genomic.fna.gz (genome) and _genomic.gff.gz (annotation) — you need both for CDS extraction with gffread.
🛄gzip -d *.gzDecompress all downloaded files at once. For CodeML analysis, you need uncompressed FASTA and GFF files. gzip preserves the original file date (useful for tracking downloads).
🧠5 species selectionThe species set is chosen to have enough phylogenetic diversity to detect positive selection (multiple synonymous substitutions per codon) but not so diverged that codon alignments break down. Shigella serves as the outgroup.

Step 2 — Extract CDS Nucleotide and Protein Sequences

Always extract your own proteins and CDS from GFF — this guarantees consistent IDs between nucleotide and protein sequences, which is critical for codon alignments:

Bash — Extract proteins and CDS with gffread
module load cufflinks/2.2.1   # provides gffread
# or: conda install -c bioconda gffread

mkdir -p proteins cds

for GFF in 20_TestDataset/*.gff; do
    BASE=$(basename ${GFF} _genomic.gff)
    FNA="${GFF//_genomic.gff/_genomic.fna}"

    echo "Extracting: ${BASE}"

    # Extract protein sequences
    gffread ${GFF} -g ${FNA} \
        -y proteins/${BASE}.prot.fa

    # Extract CDS nucleotide sequences
    gffread ${GFF} -g ${FNA} \
        -x cds/${BASE}.cds.fa

    echo "  Proteins: $(grep -c '>' proteins/${BASE}.prot.fa)"
    echo "  CDS: $(grep -c '>' cds/${BASE}.cds.fa)"
done
Download:
gffread extraction decoded
📄gffread GFF -g FNA -y proteins.faExtracts protein sequences using the GFF annotation and genome FASTA. -y outputs translated protein sequences. gffread uses the CDS features in the GFF to find each gene's exons, join them, and translate.
📊gffread GFF -g FNA -x cds.fa-x outputs the spliced CDS nucleotide sequences. These are the raw codons before translation. Critical for pal2nal — you need both the protein alignment AND its matching nucleotide sequences.
Consistent IDs matterBy extracting both proteins and CDS from the same GFF with the same tool, every protein ID has an exact matching CDS ID. This 1:1 correspondence is required by pal2nal. Mixing databases (e.g. NCBI protein + Ensembl CDS) often breaks at this step due to ID mismatches.

Step 3 — OrthoFinder Single-Copy Orthogroups

Bash — OrthoFinder + extract single-copy orthologs
module load orthofinder/2.5.2
module load diamond/2.0.4

# Run OrthoFinder on all 5 protein sets
orthofinder \
    -f proteins/ \
    -t 16 \
    -S diamond \
    -o orthofinder_results/

# Get the list of orthogroups with exactly 1 gene per species
# (single-copy orthologs — cleanest signal for selection analysis)
SC_FILE="orthofinder_results/Results_*/Orthogroups/Orthogroups_SingleCopyOrthologues.txt"
echo "Single-copy orthogroups: $(cat ${SC_FILE} | wc -l)"

# For each single-copy orthogroup, extract:
# 1. The protein sequences (for alignment)
# 2. The CDS sequences (for codon alignment with pal2nal)
mkdir -p sc_orthologs/{proteins,cds}

OG_DIR="orthofinder_results/Results_*/Orthogroup_Sequences/"
while read OG; do
    # Copy orthogroup protein FASTA
    cp ${OG_DIR}/${OG}.fa sc_orthologs/proteins/${OG}.fa

    # Build matching CDS FASTA (same IDs, nucleotide sequences)
    python3 << PYEOF
import re
from Bio import SeqIO

og = "${OG}"
cds_seqs = {}
for cds_file in ["cds/GCF_000007405.1.cds.fa",
                  "cds/GCF_001721125.1.cds.fa",
                  "cds/GCF_003697165.2.cds.fa",
                  "cds/GCF_013357365.1.cds.fa",
                  "cds/GCF_000005845.2.cds.fa"]:
    for rec in SeqIO.parse(cds_file, "fasta"):
        cds_seqs[rec.id] = str(rec.seq)

with open(f"sc_orthologs/cds/{og}.cds.fa", "w") as out:
    for rec in SeqIO.parse(f"sc_orthologs/proteins/{og}.fa", "fasta"):
        if rec.id in cds_seqs:
            out.write(f">{rec.id}\n{cds_seqs[rec.id]}\n")
PYEOF
done < ${SC_FILE}
Download:

Step 4 — Codon-Aware Alignments with clustalo + pal2nal

CodeML requires codon alignments — nucleotide alignments guided by protein alignments to preserve reading frames:

Bash — Protein alignment then pal2nal codon alignment
module load clustalo/1.2.4
module load pal2nal.v14

mkdir -p alignments/{protein_aln,codon_aln}

for PROT in sc_orthologs/proteins/OG*.fa; do
    OG=$(basename ${PROT} .fa)
    CDS="sc_orthologs/cds/${OG}.cds.fa"

    # Step 1: Align proteins with clustalo
    clustalo \
        -i ${PROT} \
        -o alignments/protein_aln/${OG}.aln.fa \
        --outfmt=fa \
        --threads=4

    # Step 2: Map protein alignment onto CDS to get codon alignment
    # pal2nal ensures no partial codons and removes stop codons
    pal2nal.pl \
        alignments/protein_aln/${OG}.aln.fa \
        ${CDS} \
        -output paml \
        > alignments/codon_aln/${OG}.codon.phy

    # -output paml  : PHYLIP format required by CodeML
    # Removes stop codons and mismatches automatically

    if [ $? -eq 0 ]; then
        echo "OK: ${OG}"
    else
        echo "FAILED: ${OG}" >&2
    fi
done

echo "Codon alignments created: $(ls alignments/codon_aln/*.phy | wc -l)"
Download:
clustalo + pal2nal pipeline decoded
📊clustalo -i proteins -o aln.faClustal Omega aligns protein sequences using a guide tree and HMM profiles. Protein alignment is more reliable than nucleotide alignment because amino acid substitution matrices (BLOSUM62) capture evolutionary signal better than nucleotide models.
🔗pal2nal.pl aln.fa cds.fa -output pamlpal2nal maps a protein alignment back onto nucleotide CDS sequences codon-by-codon. Each aligned amino acid becomes a 3-nucleotide codon block. This preserves the reading frame that CodeML needs. -output paml writes PHYLIP format required by CodeML.
stop codon removalpal2nal automatically removes stop codons (TGA, TAA, TAG) from the alignment. CodeML cannot handle stop codons in the alignment and will crash or give nonsense results if they are present.
🧬
Why codon alignment? A simple nucleotide alignment loses the reading frame at every indel. pal2nal ensures that gaps in the protein alignment become 3-nucleotide gaps in the codon alignment, keeping codons intact. CodeML models substitutions codon-by-codon — so a broken reading frame = meaningless dN/dS.

Step 5 — Gene Trees

Bash — IQ-TREE gene trees for CodeML
module load iqtree/2.2.0

mkdir -p trees

for ALN in alignments/protein_aln/OG*.aln.fa; do
    OG=$(basename ${ALN} .aln.fa)

    iqtree2 \
        -s ${ALN} \
        -m LG+G4 \
        -T 4 \
        -pre trees/${OG} \
        --quiet

    # Rename tree file to match OG name
    # IQ-TREE produces ${OG}.treefile
done

echo "Trees built: $(ls trees/*.treefile | wc -l)"

# Verify tree format (CodeML needs Newick with no branch lengths for site models)
head -1 trees/OG0000001.treefile
Download:
IQ-TREE gene tree flags decoded
📊-m LG+G4Use the LG amino acid substitution model with a Gamma rate distribution (4 categories). LG is a modern, empirical protein substitution matrix. +G4 accounts for rate variation across sites. Use -m TEST to let IQ-TREE select the best model automatically (slower but more rigorous).
🧰-T 4Use 4 threads per tree. For genome-wide analysis with thousands of small orthogroups, it is faster to run many trees at 1 thread each (array jobs) rather than few trees at many threads.
📄-pre trees/OGPrefix for all output files. IQ-TREE creates OG.treefile (best tree), OG.iqtree (full log), and OG.log. Only .treefile is needed by CodeML.
🌳
Gene tree vs species tree: CodeML site models use per-gene trees, not the species tree. Gene trees can differ from the species tree due to incomplete lineage sorting, horizontal gene transfer, or duplication. For small datasets (5 species) the difference is usually minor, but for large multi-species analyses always use gene-specific trees.

Step 6 — CodeML Site Models (M7 vs M8)

The likelihood ratio test (LRT) between Model 7 (beta distribution, ω ≤ 1) and Model 8 (beta + extra class allowing ω > 1) is the standard test for positive selection:

Bash + Python — CodeML batch run M7/M8
module load paml/4.9h

mkdir -p codeml_results/{M7,M8}

# Generate CodeML control files and run both models per orthogroup
for ALN in alignments/codon_aln/OG*.codon.phy; do
    OG=$(basename ${ALN} .codon.phy)
    TREE="trees/${OG}.treefile"
    [ ! -f "${TREE}" ] && continue

    # Model 7: beta distribution (null model, no positive selection)
    mkdir -p codeml_results/M7/${OG}
    cat > codeml_results/M7/${OG}/codeml.ctl << EOF
      seqfile = ../../../${ALN}
      treefile = ../../../${TREE}
      outfile = out_M7.txt
      noisy = 0
      verbose = 0
      runmode = 0
      seqtype = 1
      CodonFreq = 2
      ndata = 1
      clock = 0
      model = 0
      NSsites = 7
      icode = 0
      fix_kappa = 0
      kappa = 2
      fix_omega = 0
      omega = 0.4
      ncatG = 10
      getSE = 0
      RateAncestor = 0
      Small_Diff = 5e-7
      cleandata = 1
EOF

    cd codeml_results/M7/${OG}
    codeml codeml.ctl > run.log 2>&1
    cd ../../..

    # Model 8: beta + omega class (alternative model, positive selection allowed)
    mkdir -p codeml_results/M8/${OG}
    cp codeml_results/M7/${OG}/codeml.ctl codeml_results/M8/${OG}/codeml.ctl
    sed -i 's/NSsites = 7/NSsites = 8/' codeml_results/M8/${OG}/codeml.ctl
    sed -i 's/outfile = out_M7.txt/outfile = out_M8.txt/' codeml_results/M8/${OG}/codeml.ctl

    cd codeml_results/M8/${OG}
    codeml codeml.ctl > run.log 2>&1
    cd ../../..

    echo "Done: ${OG}"
done
Download:
CodeML control file parameters decoded
seqtype = 1Tells CodeML the input is codon data (1=codons, 2=amino acids). Must be 1 for dN/dS analysis. With amino acid data you cannot compute dN/dS.
🧬CodonFreq = 2Codon frequency model. 0=equal, 1=F1x4, 2=F3x4, 3=codon table. F3x4 estimates nucleotide frequencies at each codon position independently — a good balance of accuracy and parameter count.
🎯NSsites = 7 / 8The site model to run. M7 (NSsites=7): beta distribution of omega values, all ≤1 (null — no positive selection). M8 (NSsites=8): beta distribution PLUS an extra class with omega >1 (alternative — positive selection allowed). The LRT compares their log-likelihoods.
cleandata = 1Removes sites with gaps or ambiguous nucleotides before analysis. Safer for automated pipelines where some alignments may have problem columns.
🔄sed -i NSsites = 7 to 8Reuses the M7 control file for M8 by swapping just the model number. This ensures both models use identical data, tree, and settings — essential for a valid likelihood ratio test.
Why M7 vs M8? The LRT statistic = 2 * (lnL_M8 - lnL_M7). Under the null hypothesis (no positive selection), this follows a chi-squared distribution with 2 degrees of freedom. If p < 0.05, we reject the null and conclude some sites are under positive selection. M8 also reports which specific codons are under selection via Bayes Empirical Bayes (BEB) analysis.

Step 7 — Parse Results & Likelihood Ratio Test

Python — Parse CodeML output and compute LRT p-values
#!/usr/bin/env python3
"""Parse CodeML M7/M8 results and perform likelihood ratio test."""

import re, os
from scipy import stats
import pandas as pd

def parse_lnL(codeml_outfile):
    """Extract log-likelihood from CodeML output file."""
    with open(codeml_outfile) as f:
        content = f.read()
    match = re.search(r"lnL\(ntime:.*\):\s+([-\d.]+)", content)
    if match:
        return float(match.group(1))
    return None

def parse_omega_M8(codeml_outfile):
    """Extract omega estimate for the positive selection class from M8."""
    with open(codeml_outfile) as f:
        content = f.read()
    match = re.search(r"p=\s+[\d.]+\s+q=\s+[\d.]+\n.+w=\s+([\d.]+)", content)
    if match:
        return float(match.group(1))
    return None

results = []
base = "codeml_results"

for og in os.listdir(f"{base}/M7"):
    m7_out = f"{base}/M7/{og}/out_M7.txt"
    m8_out = f"{base}/M8/{og}/out_M8.txt"

    if not (os.path.exists(m7_out) and os.path.exists(m8_out)):
        continue

    lnL_M7 = parse_lnL(m7_out)
    lnL_M8 = parse_lnL(m8_out)
    omega_M8 = parse_omega_M8(m8_out)

    if lnL_M7 is None or lnL_M8 is None:
        continue

    # Likelihood ratio statistic: 2*(lnL_M8 - lnL_M7)
    # Degrees of freedom = 2 (M8 has 2 extra params: p and omega)
    LRT = 2 * (lnL_M8 - lnL_M7)
    pval = stats.chi2.sf(LRT, df=2)

    results.append({
        "OG": og,
        "lnL_M7": lnL_M7,
        "lnL_M8": lnL_M8,
        "LRT": LRT,
        "pvalue": pval,
        "omega_positive_class": omega_M8
    })

df = pd.DataFrame(results)

# Multiple testing correction (Bonferroni or BH-FDR)
from statsmodels.stats.multitest import multipletests
_, fdr, _, _ = multipletests(df["pvalue"], method="fdr_bh")
df["FDR"] = fdr

# Sort by FDR
df = df.sort_values("FDR")

print(f"Orthogroups tested: {len(df)}")
print(f"Under positive selection (FDR < 0.05): {(df.FDR < 0.05).sum()}")
print(f"\nTop positively selected genes:")
print(df[df.FDR < 0.05][["OG","LRT","pvalue","FDR","omega_positive_class"]].head(20))

df.to_csv("positive_selection_results.csv", index=False)
print("\nSaved: positive_selection_results.csv")
Download:
ColumnDescription
lnL_M7Log-likelihood under null model (no positive selection allowed)
lnL_M8Log-likelihood under alternative model (positive selection allowed)
LRTLikelihood ratio statistic = 2×(lnL_M8 − lnL_M7). Chi-squared distributed with 2 df.
pvalueRaw p-value from chi-squared test
FDRBenjamini–Hochberg corrected false discovery rate
omega_positive_classω estimated for the positively selected site class (should be >1)
Interpreting results: A gene is under positive selection if FDR < 0.05 AND omega_positive_class > 1. In the E. coli dataset, virulence factors, membrane proteins, and genes involved in host-pathogen interaction typically show the strongest signal. Look up significant genes in EcoCyc or UniProt to understand their function.

Troubleshooting

CodeML: "lnL not found" or blank output files

Check the run.log for each orthogroup. Common causes: (1) Alignment contains stop codons — pal2nal should remove these with -nogap option. (2) Alignment has only 2 sequences — CodeML site models need at least 3. (3) Control file path errors — CodeML uses relative paths; always cd into the directory before running.

LRT < 0 for some orthogroups

M8 lnL < M7 lnL is numerically possible when the optimizer gets stuck in a local maximum. Set LRT to 0 for these genes (they show no evidence of positive selection). Increase optimization iterations in the control file: add iterations = 2000.

pal2nal fails with "frame shift" errors

Protein and CDS sequences have different lengths or out-of-frame sequences. This happens when gffread extracts incomplete CDS due to annotation errors. Filter out orthogroups where protein length × 3 ≠ CDS length − 3 (stop codon). Add -nogap to pal2nal to skip problematic positions.