Identify genes under positive (diversifying) selection across E. coli and Shigella genomes — from OrthoFinder orthogroups to codon alignments, CodeML site model tests, and likelihood ratio statistics.
The ratio of nonsynonymous (dN) to synonymous (dS) substitutions — the ω (omega) ratio — is the gold standard for measuring selection at the molecular level:
| ω (dN/dS) | Interpretation | Biological meaning |
|---|---|---|
| ω < 1 | Purifying (negative) selection | Amino acid changes are harmful; protein function conserved |
| ω = 1 | Neutral evolution | Amino acid changes have no fitness effect |
| ω > 1 | Positive (diversifying) selection | Amino acid changes are beneficial; protein adapting |
The PAML/CodeML workflow requires running one analysis per orthogroup. Keep alignments, trees, and results in parallel numbered directories for easy batch processing and comparison.
parallel locally or SLURM array jobs on an HPC cluster.Five enterobacterial genomes spanning E. coli diversity plus the closely related Shigella flexneri as an outgroup:
| Species / Strain | NCBI Accession | Source |
|---|---|---|
| Shigella flexneri 2a 301 | GCF_000007405.1 | NCBI RefSeq |
| E. coli FORC_036 (food pathogen) | GCF_001721125.1 | NCBI RefSeq |
| E. coli UMEA 3173-1 | GCF_003697165.2 | NCBI RefSeq |
| E. coli ATCC 35401 | GCF_013357365.1 | NCBI RefSeq |
| E. coli K12 MG1655 | GCF_000005845.2 | NCBI RefSeq |
mkdir -p 20_TestDataset && cd 20_TestDataset # Shigella flexneri wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.gff.gz # E. coli FORC_036 wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.gff.gz # E. coli UMEA 3173-1 wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.gff.gz # E. coli ATCC 35401 wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.gff.gz # E. coli K12 MG1655 wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz gzip -d *.gz echo "Genomes ready:" ls *.fna | wc -l
| 🌍wget NCBI FTP URL | NCBI FTP paths follow a predictable pattern: /genomes/all/GCF/[3-digit split]/[accession]/[version]/. Download both the _genomic.fna.gz (genome) and _genomic.gff.gz (annotation) — you need both for CDS extraction with gffread. |
| 🛄gzip -d *.gz | Decompress all downloaded files at once. For CodeML analysis, you need uncompressed FASTA and GFF files. gzip preserves the original file date (useful for tracking downloads). |
| 🧠5 species selection | The species set is chosen to have enough phylogenetic diversity to detect positive selection (multiple synonymous substitutions per codon) but not so diverged that codon alignments break down. Shigella serves as the outgroup. |
Always extract your own proteins and CDS from GFF — this guarantees consistent IDs between nucleotide and protein sequences, which is critical for codon alignments:
module load cufflinks/2.2.1 # provides gffread
# or: conda install -c bioconda gffread
mkdir -p proteins cds
for GFF in 20_TestDataset/*.gff; do
BASE=$(basename ${GFF} _genomic.gff)
FNA="${GFF//_genomic.gff/_genomic.fna}"
echo "Extracting: ${BASE}"
# Extract protein sequences
gffread ${GFF} -g ${FNA} \
-y proteins/${BASE}.prot.fa
# Extract CDS nucleotide sequences
gffread ${GFF} -g ${FNA} \
-x cds/${BASE}.cds.fa
echo " Proteins: $(grep -c '>' proteins/${BASE}.prot.fa)"
echo " CDS: $(grep -c '>' cds/${BASE}.cds.fa)"
done
| 📄gffread GFF -g FNA -y proteins.fa | Extracts protein sequences using the GFF annotation and genome FASTA. -y outputs translated protein sequences. gffread uses the CDS features in the GFF to find each gene's exons, join them, and translate. |
| 📊gffread GFF -g FNA -x cds.fa | -x outputs the spliced CDS nucleotide sequences. These are the raw codons before translation. Critical for pal2nal — you need both the protein alignment AND its matching nucleotide sequences. |
| ⚠Consistent IDs matter | By extracting both proteins and CDS from the same GFF with the same tool, every protein ID has an exact matching CDS ID. This 1:1 correspondence is required by pal2nal. Mixing databases (e.g. NCBI protein + Ensembl CDS) often breaks at this step due to ID mismatches. |
module load orthofinder/2.5.2
module load diamond/2.0.4
# Run OrthoFinder on all 5 protein sets
orthofinder \
-f proteins/ \
-t 16 \
-S diamond \
-o orthofinder_results/
# Get the list of orthogroups with exactly 1 gene per species
# (single-copy orthologs — cleanest signal for selection analysis)
SC_FILE="orthofinder_results/Results_*/Orthogroups/Orthogroups_SingleCopyOrthologues.txt"
echo "Single-copy orthogroups: $(cat ${SC_FILE} | wc -l)"
# For each single-copy orthogroup, extract:
# 1. The protein sequences (for alignment)
# 2. The CDS sequences (for codon alignment with pal2nal)
mkdir -p sc_orthologs/{proteins,cds}
OG_DIR="orthofinder_results/Results_*/Orthogroup_Sequences/"
while read OG; do
# Copy orthogroup protein FASTA
cp ${OG_DIR}/${OG}.fa sc_orthologs/proteins/${OG}.fa
# Build matching CDS FASTA (same IDs, nucleotide sequences)
python3 << PYEOF
import re
from Bio import SeqIO
og = "${OG}"
cds_seqs = {}
for cds_file in ["cds/GCF_000007405.1.cds.fa",
"cds/GCF_001721125.1.cds.fa",
"cds/GCF_003697165.2.cds.fa",
"cds/GCF_013357365.1.cds.fa",
"cds/GCF_000005845.2.cds.fa"]:
for rec in SeqIO.parse(cds_file, "fasta"):
cds_seqs[rec.id] = str(rec.seq)
with open(f"sc_orthologs/cds/{og}.cds.fa", "w") as out:
for rec in SeqIO.parse(f"sc_orthologs/proteins/{og}.fa", "fasta"):
if rec.id in cds_seqs:
out.write(f">{rec.id}\n{cds_seqs[rec.id]}\n")
PYEOF
done < ${SC_FILE}
CodeML requires codon alignments — nucleotide alignments guided by protein alignments to preserve reading frames:
module load clustalo/1.2.4
module load pal2nal.v14
mkdir -p alignments/{protein_aln,codon_aln}
for PROT in sc_orthologs/proteins/OG*.fa; do
OG=$(basename ${PROT} .fa)
CDS="sc_orthologs/cds/${OG}.cds.fa"
# Step 1: Align proteins with clustalo
clustalo \
-i ${PROT} \
-o alignments/protein_aln/${OG}.aln.fa \
--outfmt=fa \
--threads=4
# Step 2: Map protein alignment onto CDS to get codon alignment
# pal2nal ensures no partial codons and removes stop codons
pal2nal.pl \
alignments/protein_aln/${OG}.aln.fa \
${CDS} \
-output paml \
> alignments/codon_aln/${OG}.codon.phy
# -output paml : PHYLIP format required by CodeML
# Removes stop codons and mismatches automatically
if [ $? -eq 0 ]; then
echo "OK: ${OG}"
else
echo "FAILED: ${OG}" >&2
fi
done
echo "Codon alignments created: $(ls alignments/codon_aln/*.phy | wc -l)"
| 📊clustalo -i proteins -o aln.fa | Clustal Omega aligns protein sequences using a guide tree and HMM profiles. Protein alignment is more reliable than nucleotide alignment because amino acid substitution matrices (BLOSUM62) capture evolutionary signal better than nucleotide models. |
| 🔗pal2nal.pl aln.fa cds.fa -output paml | pal2nal maps a protein alignment back onto nucleotide CDS sequences codon-by-codon. Each aligned amino acid becomes a 3-nucleotide codon block. This preserves the reading frame that CodeML needs. -output paml writes PHYLIP format required by CodeML. |
| ❌stop codon removal | pal2nal automatically removes stop codons (TGA, TAA, TAG) from the alignment. CodeML cannot handle stop codons in the alignment and will crash or give nonsense results if they are present. |
module load iqtree/2.2.0
mkdir -p trees
for ALN in alignments/protein_aln/OG*.aln.fa; do
OG=$(basename ${ALN} .aln.fa)
iqtree2 \
-s ${ALN} \
-m LG+G4 \
-T 4 \
-pre trees/${OG} \
--quiet
# Rename tree file to match OG name
# IQ-TREE produces ${OG}.treefile
done
echo "Trees built: $(ls trees/*.treefile | wc -l)"
# Verify tree format (CodeML needs Newick with no branch lengths for site models)
head -1 trees/OG0000001.treefile
| 📊-m LG+G4 | Use the LG amino acid substitution model with a Gamma rate distribution (4 categories). LG is a modern, empirical protein substitution matrix. +G4 accounts for rate variation across sites. Use -m TEST to let IQ-TREE select the best model automatically (slower but more rigorous). |
| 🧰-T 4 | Use 4 threads per tree. For genome-wide analysis with thousands of small orthogroups, it is faster to run many trees at 1 thread each (array jobs) rather than few trees at many threads. |
| 📄-pre trees/OG | Prefix for all output files. IQ-TREE creates OG.treefile (best tree), OG.iqtree (full log), and OG.log. Only .treefile is needed by CodeML. |
The likelihood ratio test (LRT) between Model 7 (beta distribution, ω ≤ 1) and Model 8 (beta + extra class allowing ω > 1) is the standard test for positive selection:
module load paml/4.9h
mkdir -p codeml_results/{M7,M8}
# Generate CodeML control files and run both models per orthogroup
for ALN in alignments/codon_aln/OG*.codon.phy; do
OG=$(basename ${ALN} .codon.phy)
TREE="trees/${OG}.treefile"
[ ! -f "${TREE}" ] && continue
# Model 7: beta distribution (null model, no positive selection)
mkdir -p codeml_results/M7/${OG}
cat > codeml_results/M7/${OG}/codeml.ctl << EOF
seqfile = ../../../${ALN}
treefile = ../../../${TREE}
outfile = out_M7.txt
noisy = 0
verbose = 0
runmode = 0
seqtype = 1
CodonFreq = 2
ndata = 1
clock = 0
model = 0
NSsites = 7
icode = 0
fix_kappa = 0
kappa = 2
fix_omega = 0
omega = 0.4
ncatG = 10
getSE = 0
RateAncestor = 0
Small_Diff = 5e-7
cleandata = 1
EOF
cd codeml_results/M7/${OG}
codeml codeml.ctl > run.log 2>&1
cd ../../..
# Model 8: beta + omega class (alternative model, positive selection allowed)
mkdir -p codeml_results/M8/${OG}
cp codeml_results/M7/${OG}/codeml.ctl codeml_results/M8/${OG}/codeml.ctl
sed -i 's/NSsites = 7/NSsites = 8/' codeml_results/M8/${OG}/codeml.ctl
sed -i 's/outfile = out_M7.txt/outfile = out_M8.txt/' codeml_results/M8/${OG}/codeml.ctl
cd codeml_results/M8/${OG}
codeml codeml.ctl > run.log 2>&1
cd ../../..
echo "Done: ${OG}"
done
| ⚙seqtype = 1 | Tells CodeML the input is codon data (1=codons, 2=amino acids). Must be 1 for dN/dS analysis. With amino acid data you cannot compute dN/dS. |
| 🧬CodonFreq = 2 | Codon frequency model. 0=equal, 1=F1x4, 2=F3x4, 3=codon table. F3x4 estimates nucleotide frequencies at each codon position independently — a good balance of accuracy and parameter count. |
| 🎯NSsites = 7 / 8 | The site model to run. M7 (NSsites=7): beta distribution of omega values, all ≤1 (null — no positive selection). M8 (NSsites=8): beta distribution PLUS an extra class with omega >1 (alternative — positive selection allowed). The LRT compares their log-likelihoods. |
| ✅cleandata = 1 | Removes sites with gaps or ambiguous nucleotides before analysis. Safer for automated pipelines where some alignments may have problem columns. |
| 🔄sed -i NSsites = 7 to 8 | Reuses the M7 control file for M8 by swapping just the model number. This ensures both models use identical data, tree, and settings — essential for a valid likelihood ratio test. |
#!/usr/bin/env python3
"""Parse CodeML M7/M8 results and perform likelihood ratio test."""
import re, os
from scipy import stats
import pandas as pd
def parse_lnL(codeml_outfile):
"""Extract log-likelihood from CodeML output file."""
with open(codeml_outfile) as f:
content = f.read()
match = re.search(r"lnL\(ntime:.*\):\s+([-\d.]+)", content)
if match:
return float(match.group(1))
return None
def parse_omega_M8(codeml_outfile):
"""Extract omega estimate for the positive selection class from M8."""
with open(codeml_outfile) as f:
content = f.read()
match = re.search(r"p=\s+[\d.]+\s+q=\s+[\d.]+\n.+w=\s+([\d.]+)", content)
if match:
return float(match.group(1))
return None
results = []
base = "codeml_results"
for og in os.listdir(f"{base}/M7"):
m7_out = f"{base}/M7/{og}/out_M7.txt"
m8_out = f"{base}/M8/{og}/out_M8.txt"
if not (os.path.exists(m7_out) and os.path.exists(m8_out)):
continue
lnL_M7 = parse_lnL(m7_out)
lnL_M8 = parse_lnL(m8_out)
omega_M8 = parse_omega_M8(m8_out)
if lnL_M7 is None or lnL_M8 is None:
continue
# Likelihood ratio statistic: 2*(lnL_M8 - lnL_M7)
# Degrees of freedom = 2 (M8 has 2 extra params: p and omega)
LRT = 2 * (lnL_M8 - lnL_M7)
pval = stats.chi2.sf(LRT, df=2)
results.append({
"OG": og,
"lnL_M7": lnL_M7,
"lnL_M8": lnL_M8,
"LRT": LRT,
"pvalue": pval,
"omega_positive_class": omega_M8
})
df = pd.DataFrame(results)
# Multiple testing correction (Bonferroni or BH-FDR)
from statsmodels.stats.multitest import multipletests
_, fdr, _, _ = multipletests(df["pvalue"], method="fdr_bh")
df["FDR"] = fdr
# Sort by FDR
df = df.sort_values("FDR")
print(f"Orthogroups tested: {len(df)}")
print(f"Under positive selection (FDR < 0.05): {(df.FDR < 0.05).sum()}")
print(f"\nTop positively selected genes:")
print(df[df.FDR < 0.05][["OG","LRT","pvalue","FDR","omega_positive_class"]].head(20))
df.to_csv("positive_selection_results.csv", index=False)
print("\nSaved: positive_selection_results.csv")
| Column | Description |
|---|---|
lnL_M7 | Log-likelihood under null model (no positive selection allowed) |
lnL_M8 | Log-likelihood under alternative model (positive selection allowed) |
LRT | Likelihood ratio statistic = 2×(lnL_M8 − lnL_M7). Chi-squared distributed with 2 df. |
pvalue | Raw p-value from chi-squared test |
FDR | Benjamini–Hochberg corrected false discovery rate |
omega_positive_class | ω estimated for the positively selected site class (should be >1) |
Check the run.log for each orthogroup. Common causes: (1) Alignment contains stop codons — pal2nal should remove these with -nogap option. (2) Alignment has only 2 sequences — CodeML site models need at least 3. (3) Control file path errors — CodeML uses relative paths; always cd into the directory before running.
M8 lnL < M7 lnL is numerically possible when the optimizer gets stuck in a local maximum. Set LRT to 0 for these genes (they show no evidence of positive selection). Increase optimization iterations in the control file: add iterations = 2000.
Protein and CDS sequences have different lengths or out-of-frame sequences. This happens when gffread extracts incomplete CDS due to annotation errors. Filter out orthogroups where protein length × 3 ≠ CDS length − 3 (stop codon). Add -nogap to pal2nal to skip problematic positions.