Genome Annotation — MAKER, BRAKER2, Functional Annotation & TE Masking

1 Overview

Genome annotation identifies genes, regulatory elements, and repeats within an assembled genome. The standard workflow:

Annotation Pipeline Order

Repeat identification & masking — RepeatModeler + RepeatMasker
Ab initio gene prediction — BRAKER2/3 (trains Augustus/GeneMark)
Evidence-based annotation — MAKER (combines ab initio + RNA-Seq + protein evidence)
Functional annotation — InterProScan, eggNOG-mapper, BLAST
Manual curation — Apollo/JBrowse for key genes

Thumb Rule: Always mask repeats before gene prediction. Unmasked repeats cause thousands of false gene predictions.

2 Repeat Identification & Masking

Bash

# Step 1: Build a custom repeat library
BuildDatabase -name mygenome assembly.fasta
RepeatModeler -database mygenome -pa 16 -LTRStruct

# Step 2: Mask the genome
RepeatMasker -pa 16 -gff -xsmall -lib mygenome-families.fa assembly.fasta

# Output files:
# assembly.fasta.masked     — hard-masked (Ns)
# assembly.fasta.out        — detailed repeat table
# assembly.fasta.out.gff    — repeat coordinates
# assembly.fasta.tbl        — summary statistics

Download:

RepeatMasker Summary (typical plant genome) ================================================== file name: assembly.fasta sequences: 12 total length: 498234102 bp GC level: 36.42 % bases masked: 349765270 bp ( 70.20 %) ================================================== number of length percentage elements (bp) of sequence -------------------------------------------------- SINEs: 234 45,678 0.01 % LINEs: 12,456 8,234,567 1.65 % LTR elements: 98,765 234,567,890 47.08 % DNA elements: 23,456 45,678,901 9.17 % Unclassified: 45,678 56,789,012 11.40 % Total repeats: 349,765,270 70.20 %

Mnemonic

"Mask Before Predicting" — MBP

Always Mask repeats → Build training set → Predict genes. Skipping masking = thousands of false positives.

3 BRAKER2/3 — Ab Initio Gene Prediction

BRAKER automatically trains Augustus and GeneMark-ETP using RNA-Seq and/or protein evidence.

Bash

# BRAKER with RNA-Seq evidence (preferred)
braker.pl --genome=assembly.masked.fasta \
  --bam=rnaseq_aligned.bam \
  --softmasking \
  --cores 16 \
  --species=my_species

# BRAKER with protein evidence only
braker.pl --genome=assembly.masked.fasta \
  --prot_seq=proteins.fasta \
  --softmasking \
  --cores 16 \
  --species=my_species

# BRAKER3: combines RNA-Seq + proteins (best)
braker.pl --genome=assembly.masked.fasta \
  --bam=rnaseq_aligned.bam \
  --prot_seq=proteins.fasta \
  --softmasking \
  --cores 16 \
  --species=my_species

Download:

Evidence Type	BRAKER Mode	Expected Gene Count Accuracy
RNA-Seq only	BRAKER1 mode	Good for expressed genes, misses tissue-specific
Protein only	BRAKER2 mode	Good coverage, may miss novel genes
RNA-Seq + Protein	BRAKER3 mode	Best overall — use when both available

Protein databases for BRAKER: Download OrthoDB proteins for your clade — e.g., Viridiplantae for plants, Metazoa for animals. Available at BRAKER datasets page.

4 MAKER — Evidence-Based Annotation

MAKER combines ab initio predictions, RNA-Seq alignments, and protein homology into consensus gene models.

Bash

# Generate control files
maker -CTL
# Edit: maker_opts.ctl, maker_bopts.ctl, maker_exe.ctl

# Key settings in maker_opts.ctl:
# genome=assembly.masked.fasta
# est=transcripts.fasta          (assembled RNA-Seq)
# protein=uniprot_proteins.fasta (related species)
# rmlib=mygenome-families.fa     (custom repeat library)
# augustus_species=my_species     (trained by BRAKER)
# est2genome=1                   (for first pass)
# protein2genome=1               (for first pass)

# Run MAKER (MPI for parallel)
mpiexec -n 16 maker -base my_annotation maker_opts.ctl maker_bopts.ctl maker_exe.ctl

# Extract results
cd my_annotation.maker.output
gff3_merge -d my_annotation_master_datastore_index.log
fasta_merge -d my_annotation_master_datastore_index.log

# Filter by AED score (Annotation Edit Distance)
# AED < 0.5 = good gene models
awk '$0 ~ /AED=/ { match($0, /AED=([0-9.]+)/, a); if (a[1] < 0.5) print }' \
  my_annotation.all.gff > high_quality_genes.gff

Download:

Thumb Rule — AED Scores: AED = 0 means perfect match to evidence. AED = 1 means no evidence. Filter at AED < 0.5 for publication. AED < 0.3 for high-confidence set.

5 Functional Annotation

Bash

# InterProScan — domain and GO annotation
interproscan.sh -i proteins.fasta \
  -f tsv,gff3,xml \
  -goterms -iprlookup -pa \
  -cpu 16 \
  -o interproscan_results

# eggNOG-mapper — ortholog-based functional annotation
emapper.py -i proteins.fasta \
  --output eggnog_results \
  -m diamond \
  --cpu 16 \
  --go_evidence non-electronic

# BLAST against SwissProt/UniProt
blastp -query proteins.fasta \
  -db swissprot \
  -evalue 1e-5 \
  -max_target_seqs 5 \
  -outfmt 6 \
  -num_threads 16 \
  -out blast_swissprot.txt

Download:

6 Methylation Calling (Bonus)

PacBio HiFi and ONT reads carry base modification information. Extract 5mC methylation:

Bash

# PacBio: pb-CpG-tools
aligned_bam_to_cpg_scores --bam aligned.hifi.bam \
  --output-prefix cpg_scores \
  --model pileup_calling_model.v1.tflite \
  --threads 16

# ONT: modkit (from Oxford Nanopore)
modkit pileup aligned.ont.bam methylation.bed \
  --ref assembly.fasta \
  --threads 16

7 AI Prompt Guide

Annotation Pipeline for Your Genome

I have a [ORGANISM] genome assembly ([SIZE] Mb, [CONTIGS] contigs, BUSCO C=[X]%). I have [RNA-Seq BAM / assembled transcripts / neither] as evidence. I have [protein sequences from related species / OrthoDB / neither]. Please write a complete annotation pipeline that: 1. Masks repeats with RepeatModeler + RepeatMasker 2. Runs BRAKER with my available evidence 3. Runs MAKER to refine gene models 4. Performs functional annotation with InterProScan + eggNOG-mapper 5. Filters final genes by AED < 0.5 6. Outputs gene count and average gene length statistics 7. Wraps everything in SLURM scripts for [CORES] CPUs, [RAM] GB RAM

8 Common Errors & Troubleshooting

MAKER: "Died at..." or hangs

Check maker_opts.ctl paths — MAKER silently fails on bad file paths. Use absolute paths. Also check that Augustus species is trained (augustus --species=help).

BRAKER: "No training genes found"

Your RNA-Seq BAM may have too few aligned reads. Check alignment rate with samtools flagstat. Ensure BAM is sorted and indexed. Try protein mode instead.

RepeatModeler: Takes days to finish

Normal for large genomes (>1 Gb). Use -pa 32 for more parallelism. For a quick alternative, use RepeatMasker with a curated library only (skip RepeatModeler).

Too many predicted genes (>2× expected)

Likely unmasked repeats or transposon fragments being called as genes. Re-run repeat masking more thoroughly, or filter by AED and protein evidence.

9 Organism-Specific Notes

Plants

70–85% of plant genomes are repetitive — repeat masking is critical
Polyploid genomes: expect 2×–6× gene count vs. diploid relatives
Use Viridiplantae OrthoDB proteins for BRAKER
Alternative: Helixer (deep-learning gene predictor, good for plants)

Animals

More introns and longer genes than plants — gene finders need proper training
Use Metazoa OrthoDB proteins
Consider NCBI's PGAP for submission-ready annotation

Microbes

Use Prokka or Bakta — much simpler than MAKER/BRAKER
Prokaryotes: no introns, high gene density, minimal repeats
Fungi: use Funannotate — designed specifically for fungi

10 All Annotation Tools

MAKER

Evidence-based pipeline (eukaryotes)

Eukaryote

BRAKER2/3

Self-training Augustus+GeneMark

Eukaryote

Helixer

Deep learning gene predictor (GPU)

DL

GALBA

Miniprot + Augustus (protein only)

Protein

Funannotate

Fungi-specific annotation pipeline

Fungi

Prokka

Fast prokaryotic annotation

Bacteria

Bakta

Modern Prokka replacement, better DB

Bacteria

PGAP

NCBI's official annotation pipeline

Submission

GeMoMa

Homology-based gene model transfer

Homology

TSEBRA

Combine BRAKER + other predictions

Combiner

InterProScan

Domain, GO, pathway annotation

Functional

eggNOG-mapper

Ortholog-based functional annotation

Functional

Mnemonics & Thumb Rules

Mnemonic

"Repeat → Train → Annotate → Function" — RTAF

RepeatMasker → Train with BRAKER → Annotate with MAKER → Functional with InterProScan. This order is mandatory.

Thumb Rule

Expected gene counts

Plants: 25,000–40,000 genes. Mammals: 20,000–25,000. Insects: 13,000–18,000. Bacteria: 3,000–6,000. If you get 2× these numbers, your repeat masking failed.

Thumb Rule — AED Filter: Start with AED < 0.5 (includes ~80% of real genes). For a high-confidence subset: AED < 0.25. For submission: manually curate genes with AED > 0.7.

Summary

You can now:

Build custom repeat libraries and mask genomes
Train gene predictors with BRAKER using RNA-Seq or protein evidence
Run MAKER for evidence-based consensus annotation
Add functional annotations with InterProScan and eggNOG
Filter and evaluate gene models by AED score

Genome Assembly Next: Comparative Genomics

Genome Annotation Pipeline

On this page

1 Overview

Annotation Pipeline Order

2 Repeat Identification & Masking

3 BRAKER2/3 — Ab Initio Gene Prediction

4 MAKER — Evidence-Based Annotation

5 Functional Annotation

6 Methylation Calling (Bonus)

7 AI Prompt Guide

Annotation Pipeline for Your Genome

8 Common Errors & Troubleshooting

MAKER: "Died at..." or hangs

BRAKER: "No training genes found"

RepeatModeler: Takes days to finish

Too many predicted genes (>2× expected)

9 Organism-Specific Notes

Plants

Animals

Microbes

10 All Annotation Tools

MAKER

BRAKER2/3

Helixer

GALBA

Funannotate

Prokka

Bakta

PGAP

GeMoMa

TSEBRA

InterProScan

eggNOG-mapper

Mnemonics & Thumb Rules

Summary