Biostar Beta. Not for public use.
Question: Error encountered while initial training with augustus for gene prediction of non model organism
0
Entering edit mode

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz

(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4

NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome

(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train

Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
    No genbank sequences found.

Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3
ADD COMMENTlink 2.6 years ago Vijay Lakhujani 4.1k • updated 15 months ago smrutimayipanda • 10
Entering edit mode
0

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

ADD REPLYlink 2.5 years ago
cristina.osuna.cruz
• 0
Entering edit mode
0

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

ADD REPLYlink 2.5 years ago
Vijay Lakhujani
4.1k
Entering edit mode
0

Hi, I am getting the same problem, can you please help me out if you had solved it?

ADD REPLYlink 16 months ago
smrutimayipanda
• 10
Entering edit mode
0

Unfortunately, I could not

ADD REPLYlink 16 months ago
Vijay Lakhujani
4.1k
Entering edit mode
0

I have done the augustus training a little bit different so working now. thanks!!!!

ADD REPLYlink 15 months ago
smrutimayipanda
• 10
0
Entering edit mode

Have you seen this: https://github.com/tseemann/prokka/issues/32

ADD COMMENTlink 16 months ago bowwow • 0
Entering edit mode
0

no i haven't checked

ADD REPLYlink 15 months ago
smrutimayipanda
• 10

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0