Genome Annotation with MAKER2

Evidence-based gene prediction on the Heterodera glycines (soybean cyst nematode) genome — MAKER2 pipeline from transcript/protein evidence to trained AUGUSTUS models and AED score evaluation.

~180 min Advanced Bash
Byte

Overview

MAKER2 is an evidence-based genome annotation pipeline that combines three types of evidence to predict gene models:

📚
EST/Transcript evidence

mRNA sequences from your species or close relatives

🧸
Protein homology

Protein sequences from related species (Swiss-Prot, clade proteomes)

🤖
Ab initio prediction

AUGUSTUS + SNAP trained on high-quality gene models

The recommended approach is iterative: Round 1 uses only transcript/protein evidence → extract confident gene models → train AUGUSTUS and SNAP → Round 2 adds ab initio predictions guided by the trained models.

Byte
Organism: Heterodera glycines — the soybean cyst nematode, a major agricultural pathogen causing ~$1 billion in crop losses annually. Its genome (~93 Mb) is complex with many transposons, making annotation challenging. Author: Rick Masonbrink (Iowa State University).

Ideal Directory Structure

MAKER2 is run in two rounds; keeping them in separate numbered directories lets you compare gene model quality between rounds and restart either round independently.

maker2_hglyc_project/
maker2_hglyc_project/ ├── evidence/ │ ├── transcripts/ # EST FASTA files │ └── proteins/ # protein FASTA files (SwissProt + clade) ├── genome/ │ └── Hglyc_genome.fasta # repeat-masked genome FASTA ├── round1_maker/ # MAKER2 round 1 (evidence-only) │ ├── maker_opts.ctl # main config file │ ├── maker_bopts.ctl # BLAST options │ ├── maker_exe.ctl # tool paths │ └── hglyc.all.gff # merged GFF3 output ├── training/ # AUGUSTUS + SNAP training files │ ├── augustus/ # trained AUGUSTUS parameter set │ └── hglyc.hmm # SNAP HMM profile ├── round2_maker/ # MAKER2 round 2 (ab initio + evidence) │ └── hglyc.final.gff # final annotation └── busco/ # BUSCO completeness check └── short_summary.txt
🔄
Why two rounds? Round 1 uses only external evidence (ESTs + proteins) to generate high-confidence "seed" gene models. These seeds are used to train ab initio predictors (AUGUSTUS, SNAP). Round 2 combines all three evidence types for a comprehensive annotation. Skipping to Round 2 without training = low-quality predictions!

Understanding AED Scores

The Annotation Edit Distance (AED) measures how well a gene model is supported by external evidence. It ranges from 0 to 1:

AED ValueInterpretationQuality
0.00 – 0.25Gene model perfectly matches evidenceHigh quality
0.25 – 0.50Minor discrepancies with evidenceAcceptable
0.50 – 0.75Substantial deviation from evidenceLow quality
0.75 – 1.00Gene model contradicts evidenceVery poor

A good annotation should have >80% of gene models with AED < 0.5. The AED score is embedded in the GFF3 output and can be plotted for quality assessment.

AED Score Explanation

AED score explanation graphic from the Bioinformatics Workbook.

Step 1 — Gather Transcriptional and Protein Evidence

Bash — Download clade transcripts and proteins
mkdir -p evidence/{transcripts,proteins}

# Download all publicly available EST and protein sequences for Tylenchida
# Sources: WormBase ParaSite (https://parasite.wormbase.org/ftp.html)
# Species: B. xylophilus, D. destructor, G. pallida, G. rostochiensis,
#          M. floridensis, M. hapla, M. incognita, etc.

# Example download for a few key species:
cd evidence/proteins
wget https://parasite.wormbase.org/ftp.html  # Navigate to download proteins

# Also download SwissProt for broad coverage:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz

# Combine all protein evidence
cat *proteins*.fa uniprot_sprot.fasta > all_proteins.fasta
cd ../..

# Download H. glycines EST sequences from NCBI (species-specific)
cd evidence/transcripts
# efetch -db nuccore -query "Heterodera glycines[Organism] AND EST[filter]" \
#   -format fasta > Hgly_ests.fasta
cd ../..

# Your genome assembly (output from previous assembly step)
GENOME="genome_assembly.fasta"
echo "Genome: $GENOME"
grep -c ">" $GENOME
Download:
Evidence gathering decoded
🚀maker -CTLGenerates the three MAKER2 control files from scratch: maker_opts.ctl (main settings), maker_bopts.ctl (BLAST parameters), and maker_exe.ctl (paths to tools). Edit these before running MAKER2.
🧹RepeatMaskerMasks transposable elements and repetitive sequences in the genome before annotation. MAKER2 runs better with a softmasked genome (lowercase = repeats) because BLAST is not distracted by repetitive regions.
📊cat *.fa > combinedConcatenates all protein evidence files into one. MAKER2 takes a single protein FASTA, so all evidence proteins (clade proteins + SwissProt) must be merged first.

Step 2 — MAKER Round 1 (Evidence Only)

First, generate MAKER control files, then edit them to point to your data:

Bash — MAKER Round 1 Setup and Run
module load maker/3.01.03
module load blast/2.12.0
module load augustus/3.4.0
module load snap

# Generate default control files
maker -CTL
# Creates: maker_opts.ctl, maker_bopts.ctl, maker_exe.ctl

# Edit maker_opts.ctl (key fields):
cat > maker_opts.ctl << 'EOF'
genome=genome_assembly.fasta
organism_type=eukaryotic

# Evidence inputs
est=evidence/transcripts/Hgly_ests.fasta
protein=evidence/proteins/all_proteins.fasta

# No ab initio predictors yet (Round 1)
snaphmm=
augustus_species=

# Repeat masking (use RepeatMasker output or leave blank to run it)
rmlib=
softmask=1   # soft-mask repeats (lowercase) instead of hard-masking

# EST/protein alignment settings
est2genome=1   # infer gene models directly from EST alignments
protein2genome=1  # infer gene models from protein alignments

# Quality filtering
min_contig=1000  # skip contigs shorter than 1kb
EOF

# Run MAKER (MPI parallel on cluster)
mpiexec -n 16 maker -fix_nucleotides 2>&1 | tee maker_round1.log

echo "Round 1 complete. Merging outputs..."
# Merge all output GFF3 and FASTA files
gff3_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
    -o round1.gff3

fasta_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
    -o round1

ls round1*
Download:
MAKER Round 1 commands decoded
🛠️maker -CTLGenerates the three control files. Open each in a text editor and fill in paths to your genome, EST, and protein files. maker_exe.ctl must point to the actual installed paths of BLAST, AUGUSTUS, etc. on your cluster.
🧱est2genome=1Tells MAKER to generate gene models directly from EST alignments (not just use them as hints). This is correct for Round 1 where we have no trained ab initio predictor yet.
🐦mpiexec -n 16 makerRun MAKER using MPI (Message Passing Interface) across 16 processes. MAKER parallelizes by splitting the genome into chunks — each MPI process handles a subset of contigs. This is the recommended way to run MAKER on a cluster.
🔀gff3_mergeMAKER writes outputs into a complex datastore directory (one folder per contig). gff3_merge collects all the individual GFF3 files and merges them into a single round1.gff3. Always merge before downstream analysis!
⏱️
Time expectations: MAKER Round 1 on a ~100 Mb eukaryotic genome with 16 cores typically takes 12–48 hours depending on the size of your EST/protein evidence databases. The BLAST steps are the bottleneck. Plan for an overnight run at minimum.

Step 3 — Train AUGUSTUS Using BUSCO

Extract high-quality gene models from Round 1 to train AUGUSTUS for ab initio prediction:

Bash — BUSCO-based AUGUSTUS training
module load busco/5.4.3
module load augustus/3.4.0

GENOME="genome_assembly.fasta"
SPECIES_NAME="h_glycines"   # unique name for your training data

# Run BUSCO in genome mode — it will train Augustus internally
# Use a nematode-appropriate lineage
busco \
    -i ${GENOME} \
    -o busco_training \
    -m genome \
    -l nematoda_odb10 \
    -c 16 \
    --augustus_species caenorhabditis  # starting species for training

# Copy trained AUGUSTUS parameters to AUGUSTUS config directory
AUGUSTUS_CONFIG=$(augustus --print-cfg 2>&1 | grep "Config" | cut -d' ' -f3)
cp -r busco_training/run_nematoda_odb10/augustus_output/retraining_parameters \
    ${AUGUSTUS_CONFIG}/species/${SPECIES_NAME}

echo "Augustus training complete for species: ${SPECIES_NAME}"
echo "Verify:"
ls ${AUGUSTUS_CONFIG}/species/${SPECIES_NAME}/
Download:

Step 4 — Train SNAP from Round 1 Annotations

Bash — SNAP training from MAKER Round 1 output
module load maker/3.01.03
module load snap

# Extract gene models with AED <= 0.25 and protein length >= 50aa
# These are high-confidence models for training
awk '{ if ($0 ~  /\tmaker\t/) print $0 }' round1.gff3 > round1.maker.gff3

maker2zff \
    -c 0.25 \
    -e 0.25 \
    -o 25 \
    -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log

# This creates genome.ann and genome.dna files
ls genome.*

# Train SNAP
mkdir -p snap_training
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl ${SPECIES_NAME} params > snap_training/${SPECIES_NAME}.hmm

echo "SNAP HMM trained: snap_training/${SPECIES_NAME}.hmm"
Download:
SNAP training pipeline decoded
maker2zff -c 0.25 -e 0.25Converts MAKER gene models to ZFF format (SNAP's input). -c and -e set AED cutoffs for coding sequence and exons — only models with AED ≤ 0.25 are used as training data. -o 25 requires proteins ≥ 25 aa.
📦fathom -categorize 1000Categorizes gene loci into training categories (single-gene, multi-gene, etc.) and creates windows of 1000 bp flanking sequence around each gene for SNAP to learn from.
🔨forgeEstimates SNAP's model parameters from the categorized training data. Output: a set of parameter files in the params/ directory.
🧠hmm-assembler.plAssembles the trained parameter files into a single SNAP HMM file (.hmm). This is what you provide to maker_opts.ctl as snaphmm= in Round 2.

Step 5 — MAKER Round 2 (With ab initio Predictors)

Bash — MAKER Round 2 with AUGUSTUS + SNAP
module load maker/3.01.03

# Update maker_opts.ctl to include trained ab initio predictors
# Key changes from Round 1:
sed -i "s/^snaphmm=.*/snaphmm=snap_training\/h_glycines.hmm/" maker_opts.ctl
sed -i "s/^augustus_species=.*/augustus_species=h_glycines/" maker_opts.ctl
sed -i "s/^est2genome=1/est2genome=0/" maker_opts.ctl      # use EST as hints, not direct models
sed -i "s/^protein2genome=1/protein2genome=0/" maker_opts.ctl  # same for proteins

# Verify the changes
grep -E "snaphmm|augustus_species|est2genome|protein2genome" maker_opts.ctl

# Run MAKER Round 2
mpiexec -n 16 maker -fix_nucleotides 2>&1 | tee maker_round2.log

# Merge outputs
gff3_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
    -o round2.gff3 -n

fasta_merge -d genome_assembly.maker.output/genome_assembly_master_datastore_index.log \
    -o round2

echo "Round 2 gene count:"
grep -c "mRNA" round2.gff3
Download:
Round 2 key changes decoded
✂️sed -i "s/.../.../ "In-place (-i) string substitution using sed. Modifies the control file directly without opening a text editor. Adds SNAP and AUGUSTUS paths and switches est/protein from genome-mode (=1) to hints-mode (=0).
🎯est2genome=0Switches EST evidence from "direct model" mode to "hints" mode. ESTs now guide the ab initio predictors instead of defining gene boundaries. This gives AUGUSTUS/SNAP more control over the final model structure.
🔍grep -c "mRNA"Counts the number of mRNA features in the GFF3 = total number of predicted gene models. Compare Round 1 vs Round 2 to see the impact of adding ab initio prediction.
🏆
Round 2 gene count: For H. glycines, Round 1 typically predicts ~18,000 gene models, while Round 2 predicts ~21,000+ models. The ab initio predictors find genes missed by EST/protein evidence alone — especially novel genes with no homologs!

Step 6 — Evaluate Gene Models with AED Plot

Bash + R — AED distribution plot
module load maker/3.01.03

# Extract AED scores from GFF3
AEDplot_lib.pm round2.gff3 > aed_plot.png
# or use the built-in MAKER script:
# maker_AED_CDF_script.pl round2.gff3 > aed.data

# Manual extraction with awk:
grep "mRNA" round2.gff3 | \
    awk '{match($9, /AED:([0-9.]+)/, a); print a[1]}' \
    > aed_scores.txt

# R: Plot AED distribution
Rscript - << 'EOF'
aed <- read.table("aed_scores.txt")$V1
png("aed_distribution.png", width=800, height=600)
hist(aed, breaks=40, col="steelblue",
     main="AED Score Distribution",
     xlab="AED Score",
     ylab="Number of Gene Models")
abline(v=0.5, col="red", lty=2, lwd=2)
legend("topright", legend="AED=0.5 cutoff", col="red", lty=2)
dev.off()
cat("Gene models with AED < 0.5:", sum(aed < 0.5), "\n")
cat("Total gene models:", length(aed), "\n")
cat("Percentage high-quality:", round(100*mean(aed < 0.5),1), "%\n")
EOF
Download:
Rule of thumb for MAKER annotations: A well-annotated genome should have >80% of gene models with AED < 0.5. BUSCO completeness of the predicted protein set should be >85% for a high-quality annotation. For H. glycines, the published annotation contains ~21,000 gene models.

Step 7 — Functional Annotation

Bash — InterProScan + BLAST functional annotation
module load interproscan/5.59-91.0
module load blast/2.12.0

PROTEINS="round2.all.maker.proteins.fasta"

# InterProScan: assign protein domains, GO terms, pathway annotations
interproscan.sh \
    -i ${PROTEINS} \
    -o interproscan_results.tsv \
    -f TSV \
    -appl Pfam,TIGRFAM,PRINTS,ProSiteProfiles,Gene3D,SMART \
    -goterms \
    -pa \
    --cpu 16

# BLAST against Swiss-Prot for gene name assignment
makeblastdb -in uniprot_sprot.fasta -dbtype prot -out swissprot_db
blastp \
    -query ${PROTEINS} \
    -db swissprot_db \
    -out blast_swissprot.txt \
    -evalue 1e-6 \
    -max_target_seqs 5 \
    -num_threads 16 \
    -outfmt "6 qseqid sseqid pident length evalue bitscore stitle"

echo "Top BLAST hits:"
head -5 blast_swissprot.txt

# Add functional annotations to GFF3
ipr_update_gff round2.gff3 interproscan_results.tsv > round2.annotated.gff3
Download:

Troubleshooting

MAKER: job stalls or some contigs never complete

Check the *_master_datastore_index.log for lines with FAILED status. Re-run those contigs by running MAKER again in the same directory — it resumes from where it stopped. Common cause: BLAST runs out of memory on very large contigs.

AUGUSTUS fails: "species not found in config directory"

The $AUGUSTUS_CONFIG_PATH environment variable must point to a writeable directory containing the trained species folder. Check with echo $AUGUSTUS_CONFIG_PATH and verify the species directory exists there. If running in a Conda environment, the path is usually $CONDA_PREFIX/config/species/.

Very few gene models after Round 2 (<5,000 for a ~90 Mb genome)

Check that est2genome and protein2genome were NOT set to 0 simultaneously in round 1 (you need at least one evidence source for model building). Also verify your EST/protein FASTA files are non-empty and in the correct FASTA format.