Question

Gene prediction in the era of long read sequencing data and many reference genomes

1

Entering edit mode

10 months ago

William ★ 5.3k

With the availability and affordability of long read sequencing it has become possible to create many reference genomes. For individuals of the same species, or for many species.

See e.g. how Hifiasm can be used with a single command to create a reference genome in a few hours or days for eukaryotic genomes. https://github.com/chhylp123/hifiasm

I am wondering what now, with modern long read RNAseq data, is an effective way to create good enough genome models for the many reference genomes.

The gene models don't need to be perfect, they never will be.

But they should contain genes, transcripts, exons, CDS and be compatible with genome browsers and analysis tools like e.g. AGAT, SnpEff.

Gene model prediction tools from when reference genomes took years to make, are quite good. But they take weeks to months to run, require many different types of input data, and require many different commands.

I am wondering what now the good enough gene prediction tools are. Given as mentioned the many reference genomes and long read RNAseq data.

gene-prediction • 517 views

ADD COMMENT • link 10 months ago by William ★ 5.3k

score 1 · Accepted Answer · 2023-06-12

BRAKER3 looks interesting.

By chance it's preprint came online today (same day as posting this question).

"New eukaryotic genomes are being sequenced at increasing rates. However, the pace of genome annotation, which establishes links between genomic sequence and biological function, is lagging behind. For example, in April 2023 49% of the eukaryotic species with assemblies in GenBank, had no annotation in 5 GenBank. Undertakings such as the Earth BioGenome Project (https://www.earthbiogenome.org), which aims to annotate c.a. 1.5 million eukaryotic species, further require that the annotation pipeline is highly automated and reliable and ideally no manual work for each species is required when genome assembly and RNA-Seq are given"

BRAKER3 is the latest pipeline in the BRAKER suite. It enables the usage of RNA-seq and protein data in a fully automated pipeline to train and predict highly reliable genes with GeneMark-ETP and AUGUSTUS.

Here we present BRAKER3, a novel genome annotation pipeline for eukaryotic genomes that integrates evidence from transcript reads, homologous proteins and the genome itself. We report significantly improved accuracy for 11 test species. BRAKER3 outperforms its predecessors BRAKER1 and BRAKER2 by a large margin, as well as publicly available pipelines, such as MAKER2, FINDER and Funannotate. The most substantial improvements are observed in species with large and complex genomes. Additionally, BRAKER3 adds a Singularity container to the BRAKER suite, which makes it more user-friendly and easier to install.

https://www.biorxiv.org/content/10.1101/2023.06.10.544449v1

https://github.com/Gaius-Augustus/BRAKER