Report on Annotation Recommended Tools
Version 1.0—December 2024
To accompany the Recommendations, the EBP provides A Report on Annotation Standards.
The annotation subcommittee has assembled a list of tools that have proven useful for given steps of the annotation process to one or more of its members. This list was created in the Fall of 2024 and is not meant to be exhaustive. Comments provided for each tool on scope, parameters or performance are based on the experience of this committee only. It is recommended to always use the latest version of a tool.
Repeat masking
RepeatModeler2
For building a de novo library of repeats that can be used for repeat masking. Options for additional LTR identification
TE-Trimmer
For trimming transposable element sequences from reads prior to assembly. Automates manual curation of TE libraries: TE boundary definition and classification to improve library construction.
RepeatMasker
For annotating and masking repetitive elements in genomic sequences.
Tandem Repeat Finder
Integrated in RepeatMasker but can additionally be run with different parameters to mask more repeats in large genomes.
Ultra
For the identification and classification of tandem repeats (including satellites)
RepeatDetector
K-mer based. Works well for vertebrates, insects and plants. May not work well for some species, fungi in particular
Windowmasker
For masking low-complexity regions in genomic sequences. K-mer based. Has a tendency to mask less than other tools
Short read alignment
STAR
RNA-Seq aligner that handles spliced alignments, with an option for diploid genomes. Two-pass mapping increases intron discovery.
Hisat2
Fast, RAM-efficient and sensitive RNA-Seq aligner
Long read/cDNA alignment
Minimap2
IsoSeq consensus or ONT reads aligner. Options for ONT cDNA and direct RNA
Transcript reconstruction
Stringtie2
Can handle RNA-Seq and IsoSeq or ONT long reads simultaneously.
Psiclass
For RNA-Seq reads. Can operate with a variable number of libraries, optimised for rare isoform construction.
Scallop
For RNA-Seq reads. Sensitive.
Mikado
Combines and filters transcript assemblies from alternative tools, samples or sequencing technologies
Protein-to-genome alignment
miniprot
Extremely fast. Reasonably accurate if distance is not very large (e.g. within mammals)
Spaln
DOI: https://doi.org/10.1093/nar/gks708 https://doi.org/10.1093/bioinformatics/btae517
Useful in case of larger evolutionary distances.
Protein-coding gene prediction
BRAKER3
Use when RNA-Seq data is available
BRAKER2
Use when no RNA-Seq data is available, for genomes <1 Gbp
GALBA
DOI: https://doi.org/10.1186/s12859-023-05449-z
For larger genomes. Quick, useful when annotating closely related species where no RNA-Seq is available
GeMoMa
URL: https://github.com/Jstacs/Jstacs/tree/master/projects/gemoma
Requires proteins and reference annotations from closely related genome/species, optionally uses short read RNA-Seq alignments
EviAnn
Requires proteins and reference annotations from closely related genome/species, optionally uses short read RNA-Seq data
Protein-coding gene set combiners
TSEBRA
Combines gene sets from BRAKER and Galba
Protein-coding gene predictors (ab initio)
Helixer
DOI: https://doi.org/10.1093/bioinformatics/btaa1044
Tiberius
Currently supports only vertebrates
Annotation by projection
LiftOff
Projection from “closely” related species
LiftOn
Projection from same or different species
CESAR2
Requires the regions to be pre-aligned. Can work at large distances
CAT
URL: https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit
Good for smaller distances (e.g. primates)
Functional annotation
InterproScan
Prediction of domain architectures
EnTap
Sequence similarity, gene family, ontology, protein domain, contaminant and HGT detection
FANTASIA
DOI: https://doi.org/10.1093/nargab/lqae078, https://doi.org/10.1101/2024.02.28.582465
GO prediction when used in combination with top GO. Very fast on GPU (e.g. 2 minutes on proteome of one species)
Non-coding gene prediction
RFAM + Infernal cmsearch
For predicting short non-coding RNAs
tRNAscan-SE
For predicting tRNAs, has a eukaryotic high confidence filter that can be applied to significantly reduce false positives
About the Subcommittee
This Report on Annotation Tools Recommendations was developed by EBP’s Scientific Subcommittee for Annotation.