Hi, did anyone trained augustus? I'm actually interested in training quality which you can estimate on test gene set. My pipeline in R for choosing training set is (I use gff from genbank): 1) In gff choose all the genes for which product is defined. NO hypothetical or predicted proteins. 2)remove all the alternative transcripts. 3) remove exon-less genes 4) check mRNA overlaps ( adding 1000 flanks) and get rid of overlapping genes 5) eventually I've decided to choose genes with annotated UTRs ( just >30 bp) as I've got better results with it. -UTRs I create in gff by myself
Resulting gff table with ~500 genes, CDS and UTR features, I turn to gb with augustus script, split it on ~350 train and test set. After etraining and checking on test set the best result I've got for gene prediction is about 0.5 Optimizing doesn't help a lot In tutorial it was suggested in bug_parameters.cfg turn "excludestopocodon..." to TRUE. Which in my case makes training quality even worse.
So main questions is what gene/ exon/UTR prediction qualities you get? Should they be so low? Do you see some fail in my pipeline and what are your suggestions about it?
Thanks!
Nastya
What type of genome are you annotating ? One of the important parameter is to remove redundancies among your gene set, so you should check that within your set you haven't any gene that share more than ~85% of similarity with another one. It could biais your training. I have made a test on the optimal number of gene to get a good Augustus training, and it was between 500 and 750. So your number of gene is a bit low...
Thanks. Dyatom, training on Fragilariopsis cylindrus genome. Yes I did not check redundancy, should try this