Genome Annotation Pipeline

Repeat masking, gene prediction, evidence-based annotation with MAKER & BRAKER2, and functional annotation with InterProScan & eggNOG-mapper.

~90 min Advanced Bash / Perl
Byte

1 Overview

Genome annotation identifies genes, regulatory elements, and repeats within an assembled genome. The standard workflow:

Annotation Pipeline Order
  1. Repeat identification & masking — RepeatModeler + RepeatMasker
  2. Ab initio gene prediction — BRAKER2/3 (trains Augustus/GeneMark)
  3. Evidence-based annotation — MAKER (combines ab initio + RNA-Seq + protein evidence)
  4. Functional annotation — InterProScan, eggNOG-mapper, BLAST
  5. Manual curation — Apollo/JBrowse for key genes
Thumb Rule: Always mask repeats before gene prediction. Unmasked repeats cause thousands of false gene predictions.

2 Repeat Identification & Masking

Bash
# Step 1: Build a custom repeat library
BuildDatabase -name mygenome assembly.fasta
RepeatModeler -database mygenome -pa 16 -LTRStruct

# Step 2: Mask the genome
RepeatMasker -pa 16 -gff -xsmall -lib mygenome-families.fa assembly.fasta

# Output files:
# assembly.fasta.masked     — hard-masked (Ns)
# assembly.fasta.out        — detailed repeat table
# assembly.fasta.out.gff    — repeat coordinates
# assembly.fasta.tbl        — summary statistics
Download:
RepeatMasker Summary (typical plant genome) ================================================== file name: assembly.fasta sequences: 12 total length: 498234102 bp GC level: 36.42 % bases masked: 349765270 bp ( 70.20 %) ================================================== number of length percentage elements (bp) of sequence -------------------------------------------------- SINEs: 234 45,678 0.01 % LINEs: 12,456 8,234,567 1.65 % LTR elements: 98,765 234,567,890 47.08 % DNA elements: 23,456 45,678,901 9.17 % Unclassified: 45,678 56,789,012 11.40 % Total repeats: 349,765,270 70.20 %
Mnemonic

"Mask Before Predicting" — MBP

Always Mask repeats → Build training set → Predict genes. Skipping masking = thousands of false positives.

3 BRAKER2/3 — Ab Initio Gene Prediction

BRAKER automatically trains Augustus and GeneMark-ETP using RNA-Seq and/or protein evidence.

Bash
# BRAKER with RNA-Seq evidence (preferred)
braker.pl --genome=assembly.masked.fasta \
  --bam=rnaseq_aligned.bam \
  --softmasking \
  --cores 16 \
  --species=my_species

# BRAKER with protein evidence only
braker.pl --genome=assembly.masked.fasta \
  --prot_seq=proteins.fasta \
  --softmasking \
  --cores 16 \
  --species=my_species

# BRAKER3: combines RNA-Seq + proteins (best)
braker.pl --genome=assembly.masked.fasta \
  --bam=rnaseq_aligned.bam \
  --prot_seq=proteins.fasta \
  --softmasking \
  --cores 16 \
  --species=my_species
Download:
Evidence TypeBRAKER ModeExpected Gene Count Accuracy
RNA-Seq onlyBRAKER1 modeGood for expressed genes, misses tissue-specific
Protein onlyBRAKER2 modeGood coverage, may miss novel genes
RNA-Seq + ProteinBRAKER3 modeBest overall — use when both available
Byte tip
Protein databases for BRAKER: Download OrthoDB proteins for your clade — e.g., Viridiplantae for plants, Metazoa for animals. Available at BRAKER datasets page.

4 MAKER — Evidence-Based Annotation

MAKER combines ab initio predictions, RNA-Seq alignments, and protein homology into consensus gene models.

Bash
# Generate control files
maker -CTL
# Edit: maker_opts.ctl, maker_bopts.ctl, maker_exe.ctl

# Key settings in maker_opts.ctl:
# genome=assembly.masked.fasta
# est=transcripts.fasta          (assembled RNA-Seq)
# protein=uniprot_proteins.fasta (related species)
# rmlib=mygenome-families.fa     (custom repeat library)
# augustus_species=my_species     (trained by BRAKER)
# est2genome=1                   (for first pass)
# protein2genome=1               (for first pass)

# Run MAKER (MPI for parallel)
mpiexec -n 16 maker -base my_annotation maker_opts.ctl maker_bopts.ctl maker_exe.ctl

# Extract results
cd my_annotation.maker.output
gff3_merge -d my_annotation_master_datastore_index.log
fasta_merge -d my_annotation_master_datastore_index.log

# Filter by AED score (Annotation Edit Distance)
# AED < 0.5 = good gene models
awk '$0 ~ /AED=/ { match($0, /AED=([0-9.]+)/, a); if (a[1] < 0.5) print }' \
  my_annotation.all.gff > high_quality_genes.gff
Download:
Thumb Rule — AED Scores: AED = 0 means perfect match to evidence. AED = 1 means no evidence. Filter at AED < 0.5 for publication. AED < 0.3 for high-confidence set.

5 Functional Annotation

Bash
# InterProScan — domain and GO annotation
interproscan.sh -i proteins.fasta \
  -f tsv,gff3,xml \
  -goterms -iprlookup -pa \
  -cpu 16 \
  -o interproscan_results

# eggNOG-mapper — ortholog-based functional annotation
emapper.py -i proteins.fasta \
  --output eggnog_results \
  -m diamond \
  --cpu 16 \
  --go_evidence non-electronic

# BLAST against SwissProt/UniProt
blastp -query proteins.fasta \
  -db swissprot \
  -evalue 1e-5 \
  -max_target_seqs 5 \
  -outfmt 6 \
  -num_threads 16 \
  -out blast_swissprot.txt
Download:

6 Methylation Calling (Bonus)

PacBio HiFi and ONT reads carry base modification information. Extract 5mC methylation:

Bash
# PacBio: pb-CpG-tools
aligned_bam_to_cpg_scores --bam aligned.hifi.bam \
  --output-prefix cpg_scores \
  --model pileup_calling_model.v1.tflite \
  --threads 16

# ONT: modkit (from Oxford Nanopore)
modkit pileup aligned.ont.bam methylation.bed \
  --ref assembly.fasta \
  --threads 16

7 AI Prompt Guide

Annotation Pipeline for Your Genome
I have a [ORGANISM] genome assembly ([SIZE] Mb, [CONTIGS] contigs, BUSCO C=[X]%). I have [RNA-Seq BAM / assembled transcripts / neither] as evidence. I have [protein sequences from related species / OrthoDB / neither]. Please write a complete annotation pipeline that: 1. Masks repeats with RepeatModeler + RepeatMasker 2. Runs BRAKER with my available evidence 3. Runs MAKER to refine gene models 4. Performs functional annotation with InterProScan + eggNOG-mapper 5. Filters final genes by AED < 0.5 6. Outputs gene count and average gene length statistics 7. Wraps everything in SLURM scripts for [CORES] CPUs, [RAM] GB RAM

8 Common Errors & Troubleshooting

MAKER: "Died at..." or hangs

Check maker_opts.ctl paths — MAKER silently fails on bad file paths. Use absolute paths. Also check that Augustus species is trained (augustus --species=help).

BRAKER: "No training genes found"

Your RNA-Seq BAM may have too few aligned reads. Check alignment rate with samtools flagstat. Ensure BAM is sorted and indexed. Try protein mode instead.

RepeatModeler: Takes days to finish

Normal for large genomes (>1 Gb). Use -pa 32 for more parallelism. For a quick alternative, use RepeatMasker with a curated library only (skip RepeatModeler).

Too many predicted genes (>2× expected)

Likely unmasked repeats or transposon fragments being called as genes. Re-run repeat masking more thoroughly, or filter by AED and protein evidence.

9 Organism-Specific Notes

Plants
  • 70–85% of plant genomes are repetitive — repeat masking is critical
  • Polyploid genomes: expect 2×–6× gene count vs. diploid relatives
  • Use Viridiplantae OrthoDB proteins for BRAKER
  • Alternative: Helixer (deep-learning gene predictor, good for plants)
Animals
  • More introns and longer genes than plants — gene finders need proper training
  • Use Metazoa OrthoDB proteins
  • Consider NCBI's PGAP for submission-ready annotation
Microbes
  • Use Prokka or Bakta — much simpler than MAKER/BRAKER
  • Prokaryotes: no introns, high gene density, minimal repeats
  • Fungi: use Funannotate — designed specifically for fungi

10 All Annotation Tools

MAKER

Evidence-based pipeline (eukaryotes)

Eukaryote
BRAKER2/3

Self-training Augustus+GeneMark

Eukaryote
Helixer

Deep learning gene predictor (GPU)

DL
GALBA

Miniprot + Augustus (protein only)

Protein
Funannotate

Fungi-specific annotation pipeline

Fungi
Prokka

Fast prokaryotic annotation

Bacteria
Bakta

Modern Prokka replacement, better DB

Bacteria
PGAP

NCBI's official annotation pipeline

Submission
GeMoMa

Homology-based gene model transfer

Homology
TSEBRA

Combine BRAKER + other predictions

Combiner
InterProScan

Domain, GO, pathway annotation

Functional
eggNOG-mapper

Ortholog-based functional annotation

Functional

Mnemonics & Thumb Rules

Mnemonic

"Repeat → Train → Annotate → Function" — RTAF

RepeatMasker → Train with BRAKER → Annotate with MAKER → Functional with InterProScan. This order is mandatory.

Thumb Rule

Expected gene counts

Plants: 25,000–40,000 genes. Mammals: 20,000–25,000. Insects: 13,000–18,000. Bacteria: 3,000–6,000. If you get 2× these numbers, your repeat masking failed.

Thumb Rule — AED Filter: Start with AED < 0.5 (includes ~80% of real genes). For a high-confidence subset: AED < 0.25. For submission: manually curate genes with AED > 0.7.

Summary

Byte
You can now:
  1. Build custom repeat libraries and mask genomes
  2. Train gene predictors with BRAKER using RNA-Seq or protein evidence
  3. Run MAKER for evidence-based consensus annotation
  4. Add functional annotations with InterProScan and eggNOG
  5. Filter and evaluate gene models by AED score