Repeat masking, gene prediction, evidence-based annotation with MAKER & BRAKER2, and functional annotation with InterProScan & eggNOG-mapper.
Genome annotation identifies genes, regulatory elements, and repeats within an assembled genome. The standard workflow:
# Step 1: Build a custom repeat library BuildDatabase -name mygenome assembly.fasta RepeatModeler -database mygenome -pa 16 -LTRStruct # Step 2: Mask the genome RepeatMasker -pa 16 -gff -xsmall -lib mygenome-families.fa assembly.fasta # Output files: # assembly.fasta.masked — hard-masked (Ns) # assembly.fasta.out — detailed repeat table # assembly.fasta.out.gff — repeat coordinates # assembly.fasta.tbl — summary statistics
"Mask Before Predicting" — MBP
Always Mask repeats → Build training set → Predict genes. Skipping masking = thousands of false positives.
BRAKER automatically trains Augustus and GeneMark-ETP using RNA-Seq and/or protein evidence.
# BRAKER with RNA-Seq evidence (preferred) braker.pl --genome=assembly.masked.fasta \ --bam=rnaseq_aligned.bam \ --softmasking \ --cores 16 \ --species=my_species # BRAKER with protein evidence only braker.pl --genome=assembly.masked.fasta \ --prot_seq=proteins.fasta \ --softmasking \ --cores 16 \ --species=my_species # BRAKER3: combines RNA-Seq + proteins (best) braker.pl --genome=assembly.masked.fasta \ --bam=rnaseq_aligned.bam \ --prot_seq=proteins.fasta \ --softmasking \ --cores 16 \ --species=my_species
| Evidence Type | BRAKER Mode | Expected Gene Count Accuracy |
|---|---|---|
| RNA-Seq only | BRAKER1 mode | Good for expressed genes, misses tissue-specific |
| Protein only | BRAKER2 mode | Good coverage, may miss novel genes |
| RNA-Seq + Protein | BRAKER3 mode | Best overall — use when both available |
Viridiplantae for plants, Metazoa for animals. Available at BRAKER datasets page.
MAKER combines ab initio predictions, RNA-Seq alignments, and protein homology into consensus gene models.
# Generate control files
maker -CTL
# Edit: maker_opts.ctl, maker_bopts.ctl, maker_exe.ctl
# Key settings in maker_opts.ctl:
# genome=assembly.masked.fasta
# est=transcripts.fasta (assembled RNA-Seq)
# protein=uniprot_proteins.fasta (related species)
# rmlib=mygenome-families.fa (custom repeat library)
# augustus_species=my_species (trained by BRAKER)
# est2genome=1 (for first pass)
# protein2genome=1 (for first pass)
# Run MAKER (MPI for parallel)
mpiexec -n 16 maker -base my_annotation maker_opts.ctl maker_bopts.ctl maker_exe.ctl
# Extract results
cd my_annotation.maker.output
gff3_merge -d my_annotation_master_datastore_index.log
fasta_merge -d my_annotation_master_datastore_index.log
# Filter by AED score (Annotation Edit Distance)
# AED < 0.5 = good gene models
awk '$0 ~ /AED=/ { match($0, /AED=([0-9.]+)/, a); if (a[1] < 0.5) print }' \
my_annotation.all.gff > high_quality_genes.gff
# InterProScan — domain and GO annotation interproscan.sh -i proteins.fasta \ -f tsv,gff3,xml \ -goterms -iprlookup -pa \ -cpu 16 \ -o interproscan_results # eggNOG-mapper — ortholog-based functional annotation emapper.py -i proteins.fasta \ --output eggnog_results \ -m diamond \ --cpu 16 \ --go_evidence non-electronic # BLAST against SwissProt/UniProt blastp -query proteins.fasta \ -db swissprot \ -evalue 1e-5 \ -max_target_seqs 5 \ -outfmt 6 \ -num_threads 16 \ -out blast_swissprot.txt
PacBio HiFi and ONT reads carry base modification information. Extract 5mC methylation:
# PacBio: pb-CpG-tools aligned_bam_to_cpg_scores --bam aligned.hifi.bam \ --output-prefix cpg_scores \ --model pileup_calling_model.v1.tflite \ --threads 16 # ONT: modkit (from Oxford Nanopore) modkit pileup aligned.ont.bam methylation.bed \ --ref assembly.fasta \ --threads 16
Check maker_opts.ctl paths — MAKER silently fails on bad file paths. Use absolute paths. Also check that Augustus species is trained (augustus --species=help).
Your RNA-Seq BAM may have too few aligned reads. Check alignment rate with samtools flagstat. Ensure BAM is sorted and indexed. Try protein mode instead.
Normal for large genomes (>1 Gb). Use -pa 32 for more parallelism. For a quick alternative, use RepeatMasker with a curated library only (skip RepeatModeler).
Likely unmasked repeats or transposon fragments being called as genes. Re-run repeat masking more thoroughly, or filter by AED and protein evidence.
Viridiplantae OrthoDB proteins for BRAKERMetazoa OrthoDB proteinsEvidence-based pipeline (eukaryotes)
EukaryoteSelf-training Augustus+GeneMark
EukaryoteDeep learning gene predictor (GPU)
DLMiniprot + Augustus (protein only)
ProteinFungi-specific annotation pipeline
FungiFast prokaryotic annotation
BacteriaModern Prokka replacement, better DB
BacteriaNCBI's official annotation pipeline
SubmissionHomology-based gene model transfer
HomologyCombine BRAKER + other predictions
CombinerDomain, GO, pathway annotation
FunctionalOrtholog-based functional annotation
Functional"Repeat → Train → Annotate → Function" — RTAF
RepeatMasker → Train with BRAKER → Annotate with MAKER → Functional with InterProScan. This order is mandatory.
Expected gene counts
Plants: 25,000–40,000 genes. Mammals: 20,000–25,000. Insects: 13,000–18,000. Bacteria: 3,000–6,000. If you get 2× these numbers, your repeat masking failed.