Evidence-based gene prediction on the Heterodera glycines (soybean cyst nematode) genome — MAKER2 pipeline from transcript/protein evidence to trained AUGUSTUS models and AED score evaluation.
MAKER2 is an evidence-based genome annotation pipeline that combines three types of evidence to predict gene models:
mRNA sequences from your species or close relatives
Protein sequences from related species (Swiss-Prot, clade proteomes)
AUGUSTUS + SNAP trained on high-quality gene models
The recommended approach is iterative: Round 1 uses only transcript/protein evidence → extract confident gene models → train AUGUSTUS and SNAP → Round 2 adds ab initio predictions guided by the trained models.
MAKER2 is run in two rounds; keeping them in separate numbered directories lets you compare gene model quality between rounds and restart either round independently.
The Annotation Edit Distance (AED) measures how well a gene model is supported by external evidence. It ranges from 0 to 1:
| AED Value | Interpretation | Quality |
|---|---|---|
| 0.00 – 0.25 | Gene model perfectly matches evidence | High quality |
| 0.25 – 0.50 | Minor discrepancies with evidence | Acceptable |
| 0.50 – 0.75 | Substantial deviation from evidence | Low quality |
| 0.75 – 1.00 | Gene model contradicts evidence | Very poor |
A good annotation should have >80% of gene models with AED < 0.5. The AED score is embedded in the GFF3 output and can be plotted for quality assessment.
AED score explanation graphic from the Bioinformatics Workbook.
mkdir -p evidence/{transcripts,proteins}
# Download all publicly available EST and protein sequences for Tylenchida
# Sources: WormBase ParaSite (https://parasite.wormbase.org/ftp.html)
# Species: B. xylophilus, D. destructor, G. pallida, G. rostochiensis,
# M. floridensis, M. hapla, M. incognita, etc.
# Example download for a few key species:
cd evidence/proteins
wget https://parasite.wormbase.org/ftp.html # Navigate to download proteins
# Also download SwissProt for broad coverage:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
# Combine all protein evidence
cat *proteins*.fa uniprot_sprot.fasta > all_proteins.fasta
cd ../..
# Download H. glycines EST sequences from NCBI (species-specific)
cd evidence/transcripts
# efetch -db nuccore -query "Heterodera glycines[Organism] AND EST[filter]" \
# -format fasta > Hgly_ests.fasta
cd ../..
# Your genome assembly (output from previous assembly step)
GENOME="genome_assembly.fasta"
echo "Genome: $GENOME"
grep -c ">" $GENOME
| 🚀maker -CTL | Generates the three MAKER2 control files from scratch: maker_opts.ctl (main settings), maker_bopts.ctl (BLAST parameters), and maker_exe.ctl (paths to tools). Edit these before running MAKER2. |
| 🧹RepeatMasker | Masks transposable elements and repetitive sequences in the genome before annotation. MAKER2 runs better with a softmasked genome (lowercase = repeats) because BLAST is not distracted by repetitive regions. |
| 📊cat *.fa > combined | Concatenates all protein evidence files into one. MAKER2 takes a single protein FASTA, so all evidence proteins (clade proteins + SwissProt) must be merged first. |
First, generate MAKER control files, then edit them to point to your data:
module load maker/3.01.03
module load blast/2.12.0
module load augustus/3.4.0
module load snap
# Generate default control files
maker -CTL
# Creates: maker_opts.ctl, maker_bopts.ctl, maker_exe.ctl
# Edit maker_opts.ctl (key fields):
cat > maker_opts.ctl << 'EOF'
genome=genome_assembly.fasta
organism_type=eukaryotic
# Evidence inputs
est=evidence/transcripts/Hgly_ests.fasta
protein=evidence/proteins/all_proteins.fasta
# No ab initio predictors yet (Round 1)
snaphmm=
augustus_species=
# Repeat masking (use RepeatMasker output or leave blank to run it)
rmlib=
softmask=1 # soft-mask repeats (lowercase) instead of hard-masking
# EST/protein alignment settings
est2genome=1 # infer gene models directly from EST alignments
protein2genome=1 # infer gene models from protein alignments
# Quality filtering
min_contig=1000 # skip contigs shorter than 1kb
EOF
# Run MAKER (MPI parallel on cluster)
mpiexec -n 16 maker -fix_nucleotides 2>&1 | tee maker_round1.log
echo "Round 1 complete. Merging outputs..."
# Merge all output GFF3 and FASTA files
gff3_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
-o round1.gff3
fasta_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
-o round1
ls round1*
| 🛠️maker -CTL | Generates the three control files. Open each in a text editor and fill in paths to your genome, EST, and protein files. maker_exe.ctl must point to the actual installed paths of BLAST, AUGUSTUS, etc. on your cluster. |
| 🧱est2genome=1 | Tells MAKER to generate gene models directly from EST alignments (not just use them as hints). This is correct for Round 1 where we have no trained ab initio predictor yet. |
| 🐦mpiexec -n 16 maker | Run MAKER using MPI (Message Passing Interface) across 16 processes. MAKER parallelizes by splitting the genome into chunks — each MPI process handles a subset of contigs. This is the recommended way to run MAKER on a cluster. |
| 🔀gff3_merge | MAKER writes outputs into a complex datastore directory (one folder per contig). gff3_merge collects all the individual GFF3 files and merges them into a single round1.gff3. Always merge before downstream analysis! |
Extract high-quality gene models from Round 1 to train AUGUSTUS for ab initio prediction:
module load busco/5.4.3
module load augustus/3.4.0
GENOME="genome_assembly.fasta"
SPECIES_NAME="h_glycines" # unique name for your training data
# Run BUSCO in genome mode — it will train Augustus internally
# Use a nematode-appropriate lineage
busco \
-i ${GENOME} \
-o busco_training \
-m genome \
-l nematoda_odb10 \
-c 16 \
--augustus_species caenorhabditis # starting species for training
# Copy trained AUGUSTUS parameters to AUGUSTUS config directory
AUGUSTUS_CONFIG=$(augustus --print-cfg 2>&1 | grep "Config" | cut -d' ' -f3)
cp -r busco_training/run_nematoda_odb10/augustus_output/retraining_parameters \
${AUGUSTUS_CONFIG}/species/${SPECIES_NAME}
echo "Augustus training complete for species: ${SPECIES_NAME}"
echo "Verify:"
ls ${AUGUSTUS_CONFIG}/species/${SPECIES_NAME}/
module load maker/3.01.03
module load snap
# Extract gene models with AED <= 0.25 and protein length >= 50aa
# These are high-confidence models for training
awk '{ if ($0 ~ /\tmaker\t/) print $0 }' round1.gff3 > round1.maker.gff3
maker2zff \
-c 0.25 \
-e 0.25 \
-o 25 \
-d genome_assembly.maker.output/genome_assembly_master_datastore_index.log
# This creates genome.ann and genome.dna files
ls genome.*
# Train SNAP
mkdir -p snap_training
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl ${SPECIES_NAME} params > snap_training/${SPECIES_NAME}.hmm
echo "SNAP HMM trained: snap_training/${SPECIES_NAME}.hmm"
| ✅maker2zff -c 0.25 -e 0.25 | Converts MAKER gene models to ZFF format (SNAP's input). -c and -e set AED cutoffs for coding sequence and exons — only models with AED ≤ 0.25 are used as training data. -o 25 requires proteins ≥ 25 aa. |
| 📦fathom -categorize 1000 | Categorizes gene loci into training categories (single-gene, multi-gene, etc.) and creates windows of 1000 bp flanking sequence around each gene for SNAP to learn from. |
| 🔨forge | Estimates SNAP's model parameters from the categorized training data. Output: a set of parameter files in the params/ directory. |
| 🧠hmm-assembler.pl | Assembles the trained parameter files into a single SNAP HMM file (.hmm). This is what you provide to maker_opts.ctl as snaphmm= in Round 2. |
module load maker/3.01.03
# Update maker_opts.ctl to include trained ab initio predictors
# Key changes from Round 1:
sed -i "s/^snaphmm=.*/snaphmm=snap_training\/h_glycines.hmm/" maker_opts.ctl
sed -i "s/^augustus_species=.*/augustus_species=h_glycines/" maker_opts.ctl
sed -i "s/^est2genome=1/est2genome=0/" maker_opts.ctl # use EST as hints, not direct models
sed -i "s/^protein2genome=1/protein2genome=0/" maker_opts.ctl # same for proteins
# Verify the changes
grep -E "snaphmm|augustus_species|est2genome|protein2genome" maker_opts.ctl
# Run MAKER Round 2
mpiexec -n 16 maker -fix_nucleotides 2>&1 | tee maker_round2.log
# Merge outputs
gff3_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
-o round2.gff3 -n
fasta_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
-o round2
echo "Round 2 gene count:"
grep -c "mRNA" round2.gff3
| ✂️sed -i "s/.../.../ " | In-place (-i) string substitution using sed. Modifies the control file directly without opening a text editor. Adds SNAP and AUGUSTUS paths and switches est/protein from genome-mode (=1) to hints-mode (=0). |
| 🎯est2genome=0 | Switches EST evidence from "direct model" mode to "hints" mode. ESTs now guide the ab initio predictors instead of defining gene boundaries. This gives AUGUSTUS/SNAP more control over the final model structure. |
| 🔍grep -c "mRNA" | Counts the number of mRNA features in the GFF3 = total number of predicted gene models. Compare Round 1 vs Round 2 to see the impact of adding ab initio prediction. |
module load maker/3.01.03
# Extract AED scores from GFF3
AEDplot_lib.pm round2.gff3 > aed_plot.png
# or use the built-in MAKER script:
# maker_AED_CDF_script.pl round2.gff3 > aed.data
# Manual extraction with awk:
grep "mRNA" round2.gff3 | \
awk '{match($9, /AED:([0-9.]+)/, a); print a[1]}' \
> aed_scores.txt
# R: Plot AED distribution
Rscript - << 'EOF'
aed <- read.table("aed_scores.txt")$V1
png("aed_distribution.png", width=800, height=600)
hist(aed, breaks=40, col="steelblue",
main="AED Score Distribution",
xlab="AED Score",
ylab="Number of Gene Models")
abline(v=0.5, col="red", lty=2, lwd=2)
legend("topright", legend="AED=0.5 cutoff", col="red", lty=2)
dev.off()
cat("Gene models with AED < 0.5:", sum(aed < 0.5), "\n")
cat("Total gene models:", length(aed), "\n")
cat("Percentage high-quality:", round(100*mean(aed < 0.5),1), "%\n")
EOF
module load interproscan/5.59-91.0
module load blast/2.12.0
PROTEINS="round2.all.maker.proteins.fasta"
# InterProScan: assign protein domains, GO terms, pathway annotations
interproscan.sh \
-i ${PROTEINS} \
-o interproscan_results.tsv \
-f TSV \
-appl Pfam,TIGRFAM,PRINTS,ProSiteProfiles,Gene3D,SMART \
-goterms \
-pa \
--cpu 16
# BLAST against Swiss-Prot for gene name assignment
makeblastdb -in uniprot_sprot.fasta -dbtype prot -out swissprot_db
blastp \
-query ${PROTEINS} \
-db swissprot_db \
-out blast_swissprot.txt \
-evalue 1e-6 \
-max_target_seqs 5 \
-num_threads 16 \
-outfmt "6 qseqid sseqid pident length evalue bitscore stitle"
echo "Top BLAST hits:"
head -5 blast_swissprot.txt
# Add functional annotations to GFF3
ipr_update_gff round2.gff3 interproscan_results.tsv > round2.annotated.gff3
Check the *_master_datastore_index.log for lines with FAILED status. Re-run those contigs by running MAKER again in the same directory — it resumes from where it stopped. Common cause: BLAST runs out of memory on very large contigs.
The $AUGUSTUS_CONFIG_PATH environment variable must point to a writeable directory containing the trained species folder. Check with echo $AUGUSTUS_CONFIG_PATH and verify the species directory exists there. If running in a Conda environment, the path is usually $CONDA_PREFIX/config/species/.
Check that est2genome and protein2genome were NOT set to 0 simultaneously in round 1 (you need at least one evidence source for model building). Also verify your EST/protein FASTA files are non-empty and in the correct FASTA format.