Question

Converting Gene Predictions In Glimmer To Protein Sequences.

0

Entering edit mode

10.7 years ago

genomelover ▴ 50

Hi to all, I have used GLIMMER 3.2 to predict genes in the bacterial genomes and I get the ORF ID and not the protein sequences. These sequences have to be used in the another tool and it should be in FASTA FORMAT.So I have used a biopython script to convert gene predictions in GFF3 format to protein sequences.I am using .predict file of GLIMMER which I am not sure as it not exactly in GFF3 format, tried checking it using GFF3 validator and only the first few lines are in GFF3 format, GFF3 format indeed is divided into many kinds which I read many days before I am still not able to figure out which format is the GLIMMER output.The biopython script I am running gives the error. Plz help as I new to the world of bioinformatics,LINUX.The biopython script is here

#!/usr/bin/env python
"""Convert GlimmerHMM GFF3 gene predictions into protein sequences.

This works with the GlimmerHMM GFF3 output format:

##gff-version 3
##sequence-region Contig5.15 1 47390
Contig5.15      GlimmerHMM      mRNA    323     325     .       +       .       ID=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1
Contig5.15      GlimmerHMM      CDS     323     325     .       +       0       ID=Contig5.15.cds1.1;Parent=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1;Note=final-exon

http://www.cbcb.umd.edu/software/GlimmerHMM/

Usage:
    glimmergff_to_proteins.py <glimmer gff3> <ref fasta>
"""
from __future__ import with_statement
import sys
import os
import operator

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

from BCBio import GFF

def main(glimmer_file, ref_file):
    with open(ref_file) as in_handle:
        ref_recs = SeqIO.to_dict(SeqIO.parse(in_handle, "fasta"))

    base, ext = os.path.splitext(glimmer_file)
    out_file = "%s-proteins.fa" % base
    with open(out_file, "w") as out_handle:
        SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")

def protein_recs(glimmer_file, ref_recs):
    """Generate protein records from GlimmerHMM gene predictions.
    """
    with open(glimmer_file) as in_handle:
        for rec in glimmer_predictions(in_handle, ref_recs):
            for feature in rec.features:
                seq_exons = []
                for cds in feature.sub_features:
                    seq_exons.append(rec.seq[
                        cds.location.nofuzzy_start:
                        cds.location.nofuzzy_end])
                gene_seq = reduce(operator.add, seq_exons)
                if feature.strand == -1:
                    gene_seq = gene_seq.reverse_complement()
                protein_seq = gene_seq.translate()
                yield SeqRecord(protein_seq, feature.qualifiers["ID"][0], "", "")

def glimmer_predictions(in_handle, ref_recs):
    """Parse Glimmer output, generating SeqRecord and SeqFeatures for predictions
    """
    for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
        yield rec

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print __doc__
        sys.exit()
    main(*sys.argv[1:])

The error which I get is I have tried running this script many times and I get this error most of the times

Traceback (most recent call last):
   File "glimmergff_to_proteins.py", line 62, in <module>
    main(*sys.argv[1:])
  File "glimmergff_to_proteins.py", line 27, in main
     with open(ref_file) as in_handle:

  Value Error:More than one record found in handle

Every time I run the script I get different errors,The previous error which I think was due to the multiple FASTA file format for the reference file and when the split file was run it shows the same, further splitting also shows the same error.The most recent error which I got was

Traceback (most recent call last):
   File "glimmergff_to_proteins.py", line 62, in <module>
    main(*sys.argv[1:])
  File "glimmergff_to_proteins.py", line 33, in main
    SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
   File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 426, in write
    count = writer_class(fp).write_file(sequences)
  File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 254, in write_file
     count = self.write_records(records)
  File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 238, in write_records
    for record in records:
  File "glimmergff_to_proteins.py", line 39, in protein_recs
     for rec in glimmer_predictions(in_handle, ref_recs):
  File "glimmergff_to_proteins.py", line 55, in glimmer_predictions
    for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 712, in parse
     target_lines):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 299, in parse_in_parts
    for results in self.parse_simple(gff_files, limit_info, target_lines):
   File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 320, in parse_simple
    for results in self._gff_process(gff_files, limit_info, target_lines):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 606, in _gff_process
     for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
  File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 637, in _lines_to_out_info
     results = self._map_fn(line, params)
  File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 157, in _gff_line_map
   8w assert len(parts) >= 8, line

Please reply and help in making the script work, have been working on this for months but without any results. Please feel free to mail me (nope: all responses stay on site (md)). My mail id is [deleted]

Here is the link to the script so you can check the line number

script biopython • 6.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 10.7 years ago by genomelover ▴ 50

5

Entering edit mode

Dear genomelover, it's quite bad netiquette to be shouting for help in uppercase letters, and to demand a reply ASAP...

If you wish us to try to help you, please take the time to edit your question to remove shouting, and please do edit your error messages, so that the post is easier to read.

If you also take the time to tell us what you have already tried to get rid of those errors, you will drastically improve your chances of getting a quick reply : http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002202.

ADD REPLY • link 10.7 years ago by Leonor Palmeira 3.9k

2

Entering edit mode

Also, including line numbers in the script above would be helpful, as without them, it's difficult to trace the errors based on the current information? I don't know what line 27 and line 62 are.

ADD REPLY • link 10.7 years ago by Jelena Aleksic ▴ 920

0

Entering edit mode

i have added the link to the script which contains the line number,kindly reply soon

ADD REPLY • link 10.7 years ago by genomelover ▴ 50

0

Entering edit mode

I see it is long time you posted this question. Since I was looking for something similar, I was trying the code you suggested. I did not get any error. The problem is the following and it is often not considered by people who thinks to be good at coding:

gene predictions can be partial. Therefore the frame of the first exon can be different from the 0th (that is the first codon can be partial). It follows that a good gff to sequence script has to take this into account e.g. by trimming the sequence of the first exon before translating it. It seems to me that the script you were using does not take this fact into account. If this is so, it will output completely wrong protein sequences in those cases.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.8 years ago by matteo.brilli • 0