Biostar Beta. Not for public use.
Gff to genbank - feature is missing
0
Entering edit mode
2.5 years ago
rororo • 0
@rororo37162

I want to predict genes from my organism of interest using augustus. I therefore have my assembly, as well as a gff3 file. I want to train augustus using my dataset using etraining. I therefore need a genbank file.

I found link with the following script to convert gff+fasta to genbank, since the provided gff2Smallgb.pl from augustus does not work:

#!/usr/bin/env python
"""Convert a GFF and associated FASTA file into GenBank format.

Usage:
gff_to_genbank.py <GFF annotation="" file=""> <FASTA sequence="" file="">
"""
from __future__ import print_function

import sys
import os

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio import Seq

from BCBio import GFF

def main(gff_file, fasta_file):
out_file = "%s.gb" % os.path.splitext(gff_file)[0]
fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
gff_iter = GFF.parse(gff_file, fasta_input)
SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "genbank")

def _fix_ncbi_id(fasta_iter):
"""GenBank identifiers can only be 16 characters; try to shorten NCBI.
"""
for rec in fasta_iter:
if lenrec.name) > 16 and rec.name.find("|") > 0:
new_id = [x for x in rec.name.split("|") if x][-1]
print("Warning: shortening NCBI name %s to %s" % rec.id, new_id))
rec.id = new_id
rec.name = new_id
yield rec

def _check_gff(gff_iterator):
"""Check GFF files before feeding to SeqIO to be sure they have sequences.
"""
for rec in gff_iterator:
if isinstance(rec.seq, Seq.UnknownSeq):
print("Warning: FASTA sequence not found for '%s' in GFF file" % (
rec.id))
rec.seq.alphabet = generic_dna
yield _flatten_features(rec)

def _flatten_features(rec):
"""Make sub_features in an input rec flat for output.

GenBank does not handle nested features, so we want to make
everything top level.
"""
out = []
for f in rec.features:
cur = [f]
while len(cur) > 0:
nextf = []
for curf in cur:
out.append(curf)
if len(curf.sub_features) > 0:
nextf.extend(curf.sub_features)
cur = nextf
rec.features = out
return rec

if __name__ == "__main__":
main(*sys.argv[1:])


Unfortunately, the feature source is not printed, so that augustus etraining logs the following error:

GBProcessor::getGeneList(): Sequence has 0 length. Maybe 'source' Feature missing?


A model genbank file is available here: NCBI genbank format

gff3 genbank DNA augustus bioinformatics • 670 views
0
Entering edit mode

Isn't the source field originating in GFF file? GFF format says this:

source - name of the program that generated this feature, or the data source (database or project name)

So try something you like as source name in your GFF.

0
Entering edit mode

source in the FEATURES table

FEATURES             Location/Qualifiers
source          1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
/map="9"
CDS             <1..206

0
Entering edit mode

Then you will need to create that yourself. If you look in the model genbank page you linked above they provide some indication of what can go in the SOURCE section.

3.4.10 SOURCE Format

The SOURCE field consists of two parts. The first part is found after
the SOURCE keyword and contains free-format information including an
abbreviated form of the organism name followed by a molecule type;
multiple lines are allowed, but the last line must end with a period.
The second part consists of information found after the ORGANISM
subkeyword. The formal scientific name for the source organism (genus
and species, where appropriate) is found on the same line as ORGANISM.
The records following the ORGANISM line list the taxonomic
classification levels, separated by semicolons and ending with a
period.

0
Entering edit mode

I have 25,000 entries. So I need an automatic way!

0
Entering edit mode

If you wrote the script above then you should be able to do that, correct?

0
Entering edit mode

sorry, I clearified the source!

0
Entering edit mode

seqret can be used to convert gff+fasta to Genbank?

seqret -sequence genome.fasta -feature -fformat gff -fopenfile annotation.gff -osformat genbank


Another alternative would be to open fasta file in Artemis -> Read GFF3 as an entry within it and save the whole thing as genbank file.

0
Entering edit mode

thanks for your help. Unfortunately, seqret does not stop converting - the file just gets bigger and bigger. Artemis freezes when opening the gff-file

0
Entering edit mode
0
Entering edit mode

0
Entering edit mode

I see! Thanks genomax. @rororo have you tried the conversion file provided with Augustus as mentioned in the thread above? https://github.com/nextgenusfs/augustus/blob/master/scripts/gff2gbSmallDNA.pl

0
Entering edit mode

since the provided gff2Smallgb.pl from augustus does not work

Not sure if that is the same script but this is in the original thread.

0
Entering edit mode

Not sure if that is the same script but this is in the original thread.

I think the original script is: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py , not sure if OP has tried the script provided with Augustus.

0
Entering edit mode

this is the script I posted above

0
Entering edit mode

Unfortunately, seqret does not stop converting - the file just gets bigger and bigger.

If you have 25,000 entries then isn't that expected? You need to be patient at times. It may take some time to complete the conversion.

0
Entering edit mode

as soon as the 25,000 entries are read and returned, seqret starts from the beginning and returns the same sequences again and again