Biostar Beta. Not for public use.
Gff to genbank - feature is missing
0
Entering edit mode
2.5 years ago
rororo • 0
@rororo37162

I want to predict genes from my organism of interest using augustus. I therefore have my assembly, as well as a gff3 file. I want to train augustus using my dataset using etraining. I therefore need a genbank file.

I found link with the following script to convert gff+fasta to genbank, since the provided gff2Smallgb.pl from augustus does not work:

#!/usr/bin/env python
"""Convert a GFF and associated FASTA file into GenBank format.

Usage:
    gff_to_genbank.py <GFF annotation="" file=""> <FASTA sequence="" file="">
"""
from __future__ import print_function

import sys
import os

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio import Seq

from BCBio import GFF

def main(gff_file, fasta_file):
    out_file = "%s.gb" % os.path.splitext(gff_file)[0]
    fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
    gff_iter = GFF.parse(gff_file, fasta_input)
    SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "genbank")

def _fix_ncbi_id(fasta_iter):
    """GenBank identifiers can only be 16 characters; try to shorten NCBI.
    """
    for rec in fasta_iter:
        if lenrec.name) > 16 and rec.name.find("|") > 0:
            new_id = [x for x in rec.name.split("|") if x][-1]
            print("Warning: shortening NCBI name %s to %s" % rec.id, new_id))
            rec.id = new_id
            rec.name = new_id
        yield rec

def _check_gff(gff_iterator):
    """Check GFF files before feeding to SeqIO to be sure they have sequences.
    """
    for rec in gff_iterator:
        if isinstance(rec.seq, Seq.UnknownSeq):
            print("Warning: FASTA sequence not found for '%s' in GFF file" % (
                    rec.id))
            rec.seq.alphabet = generic_dna
        yield _flatten_features(rec)

def _flatten_features(rec):
    """Make sub_features in an input rec flat for output.

    GenBank does not handle nested features, so we want to make
    everything top level.
    """
    out = []
    for f in rec.features:
        cur = [f]
        while len(cur) > 0:
            nextf = []
            for curf in cur:
                out.append(curf)
                if len(curf.sub_features) > 0:
                    nextf.extend(curf.sub_features)
            cur = nextf
    rec.features = out
    return rec

if __name__ == "__main__":
    main(*sys.argv[1:])

Unfortunately, the feature source is not printed, so that augustus etraining logs the following error:

GBProcessor::getGeneList(): Sequence has 0 length. Maybe 'source' Feature missing?

A model genbank file is available here: NCBI genbank format

gff3 genbank DNA augustus bioinformatics • 670 views
ADD COMMENTlink
0
Entering edit mode

Isn't the source field originating in GFF file? GFF format says this:

source - name of the program that generated this feature, or the data source (database or project name)

So try something you like as source name in your GFF.

ADD REPLYlink
0
Entering edit mode

source in the FEATURES table

FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
                     /map="9"
     CDS             <1..206
ADD REPLYlink
0
Entering edit mode

Then you will need to create that yourself. If you look in the model genbank page you linked above they provide some indication of what can go in the SOURCE section.

3.4.10 SOURCE Format

  The SOURCE field consists of two parts. The first part is found after
the SOURCE keyword and contains free-format information including an
abbreviated form of the organism name followed by a molecule type;
multiple lines are allowed, but the last line must end with a period.
The second part consists of information found after the ORGANISM
subkeyword. The formal scientific name for the source organism (genus
and species, where appropriate) is found on the same line as ORGANISM.
The records following the ORGANISM line list the taxonomic
classification levels, separated by semicolons and ending with a
period.
ADD REPLYlink
0
Entering edit mode

I have 25,000 entries. So I need an automatic way!

ADD REPLYlink
0
Entering edit mode

If you wrote the script above then you should be able to do that, correct?

ADD REPLYlink
0
Entering edit mode

sorry, I clearified the source!

ADD REPLYlink
0
Entering edit mode

seqret can be used to convert gff+fasta to Genbank?

seqret -sequence genome.fasta -feature -fformat gff -fopenfile annotation.gff -osformat genbank

Another alternative would be to open fasta file in Artemis -> Read GFF3 as an entry within it and save the whole thing as genbank file.

ADD REPLYlink
0
Entering edit mode

thanks for your help. Unfortunately, seqret does not stop converting - the file just gets bigger and bigger. Artemis freezes when opening the gff-file

ADD REPLYlink
0
Entering edit mode

Script OP has referenced to came from a link in an answer from this thread.

ADD REPLYlink
0
Entering edit mode

I see! Thanks genomax. @rororo have you tried the conversion file provided with Augustus as mentioned in the thread above? https://github.com/nextgenusfs/augustus/blob/master/scripts/gff2gbSmallDNA.pl

ADD REPLYlink
0
Entering edit mode

since the provided gff2Smallgb.pl from augustus does not work

Not sure if that is the same script but this is in the original thread.

ADD REPLYlink
0
Entering edit mode

Not sure if that is the same script but this is in the original thread.

I think the original script is: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py , not sure if OP has tried the script provided with Augustus.

ADD REPLYlink
0
Entering edit mode

this is the script I posted above

ADD REPLYlink
0
Entering edit mode

Unfortunately, seqret does not stop converting - the file just gets bigger and bigger.

If you have 25,000 entries then isn't that expected? You need to be patient at times. It may take some time to complete the conversion.

ADD REPLYlink
0
Entering edit mode

as soon as the 25,000 entries are read and returned, seqret starts from the beginning and returns the same sequences again and again

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3